Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AO Network Stability #651

Open
6 of 10 tasks
twilson63 opened this issue Apr 29, 2024 · 0 comments · Fixed by #743
Open
6 of 10 tasks

AO Network Stability #651

twilson63 opened this issue Apr 29, 2024 · 0 comments · Fixed by #743
Assignees

Comments

@twilson63
Copy link
Contributor

twilson63 commented Apr 29, 2024

The current implementation of the AO Computer is still struggling with stability, many of the issues is relating to the dependence on arweave graphql, with the optimistic cache. Here are the major pain point areas:

  1. Spawning new processes - Getting graphql timeouts or 404's when creating a new process and trying to locate the appropriate SU. When the process id does not appear in the graphql result quickly, the network is unable to discover the location of the process.

  2. Checkpoints - AO Checkpoints are used to pull the most recent state so when a process is reset, it does not have to build from the beginning. The current implementation is when a CU is being restarted to write checkpoints on exit, and read checkpoints on start (among other edge cases that will trigger a checkpoint). However, if the cu must query for checkpoints from the gateway, the gateway will often timeout or return 404s during this restart, so that the CU is unable to access the checkpoint memory to cache and has to re-evaluate from an earlier checkpoint, or worse, a coldstart.

  3. Cron - Cron monitor is unreliable and not persistent. It should be made to persist and be tolerant to errors.

  4. MU Issues

  5. SU Issues

  6. CU Issues

Affected Use Cases

  • New AOS processes fail
  • Downtime when core processes get in this state, cred, trinity, etc. Because it takes them so long to evaluate from zero.

One last note is around 429s, when the CUs or MUs make too many request they get rate-limited which could be having an impact.

@twilson63 twilson63 transferred this issue from permaweb/aos Apr 29, 2024
@jfrain99 jfrain99 self-assigned this May 28, 2024
@jfrain99 jfrain99 linked a pull request May 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants