Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamped failure models #228

Merged

Conversation

DanteNiewenhuis
Copy link
Contributor

@DanteNiewenhuis DanteNiewenhuis commented May 3, 2024

Summary

Completely remade the Failure Model system.

Now there are three types of Failure Models:

  • Trace-based: based on traces that describe when failures occur. Failure traces are defined using Parquet files.
  • sample-based: Failure Models that are defined by three sampling distributions that define the time between failures, the duration of the samples, and the number of hosts that are affected by the failures
  • prefab: Failure models that are predefined sample-based models based on research

An initial version of checkpointing is implemented, but this currently only works well with SimTraceWorkloads

More extensive documentation will be provided on the website.

Implementation Notes ⚒️

External Dependencies 🍀

N / A

Breaking API Changes ⚠️

I have tried to keep most of the existing API, but there are still some changes made.
This means that all code that use Failure Models could be effected.

Simply specify none (N/A) if not applicable.

Started incorporating the failure models

Added support for failure traces and different models

Failure traces can now be loaded with files. Failing a host causes an error.

small update

Started fixing failure injection

Added support for failure traces and different models

Failure traces can now be loaded with files. Failing a host causes an error.

small update

Started incorporating the failure models

Started fixing failure injection
@DanteNiewenhuis DanteNiewenhuis linked an issue May 3, 2024 that may be closed by this pull request
@DanteNiewenhuis DanteNiewenhuis merged commit ad20465 into atlarge-research:master May 7, 2024
4 checks passed
@DanteNiewenhuis DanteNiewenhuis deleted the failure_fix branch May 7, 2024 10:33
This was linked to issues May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for fault traces Fix fault injection Fix host failures
2 participants