Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit Logs should be able to handle pressure #9870

Open
2 tasks
radeusgd opened this issue May 7, 2024 · 1 comment
Open
2 tasks

Audit Logs should be able to handle pressure #9870

radeusgd opened this issue May 7, 2024 · 1 comment
Assignees
Labels
-libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work p-low Low priority

Comments

@radeusgd
Copy link
Member

radeusgd commented May 7, 2024

For efficiency, the logs are sent on a background thread.

There can be a problem if the logs are recorded faster than the background thread is able to process them, especially as currently they are sent one-by-one.

On my machine, the round-trip for sending a log message to the cloud usually takes around 100ms but can peak at even 2s per message. This means that in good conditions the maximum throughput can be at around 10 log events per second. This seems to be okay-ish for our current use cases, but it may very easily be overwhelmed if a big operation is performed that runs a lot of queries quickly. If the operation runs for a short time, the pending logs will be queued and should be sent with some delay. The problem starts if the operation keeps running for a longer time at a too high throughput: more and more log messages will be queued and the system may be unable to keep up.

We have two values that we need to weight:

  1. efficiency
  2. reliability of log events

The messages are sent in background for efficiency, but if the system is overwhelmed that may lead to problems with reliability.

Solutions to consider:

  • block the main thread once too many log messages are pending
    • This would allow the background thread to 'unclog' by processing the pending messages, ensuring that not too many wait in the queue at once.
    • IMO this is a reasonable compromise between efficiency and reliability - in usual scenarios the log messages are still sent asynchronously and do not slow down execution at all, but in case that the throughput is exceeded, they start blocking to ensure the background thread has some chance to keep up.
  • batch sent messages
    • Currently the messages are sent one-by-one. This is not very efficient as ~100ms time is mostly due to the round-trip latency. Increasing the payload a bit - e.g. by sending ~10-100 messages in one batch could result in similar latency, while achieving a much larger throughput.
    • We would still send the messages as they come with no delay. Just if after processing the current message, more than 1 messages are already pending (the messages were enqueued faster than we processed them), we could send multiple messages in a single batch to ensure more efficient processing.
    • This would require changes in the Cloud to allow the /logs endpoint to accept multiple messages in a single request.
@radeusgd radeusgd added p-low Low priority -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work labels May 7, 2024
@radeusgd
Copy link
Member Author

radeusgd commented May 7, 2024

Setting as low priority, because while this is a problem, current workloads should not be likely to encounter it too much.

mergify bot pushed a commit that referenced this issue May 11, 2024
- Closes #9599
- Implemented API for sending audit logs to the cloud on a background thread.
- If the Postgres connection is opened through a datalink, its internal JDBC connection is replaced by a wrapper that reports executed queries to the audit log.
- Also introduces `EnsoMeta` - a helper Java class that can be used in our helper libraries to access Enso types.
- I have replaced the common pattern scattered throughout the codebase with calls to this 'library' to avoid repetitive code.
- Refactored `Table.display` to share code between in-memory and DB - it was needed as the function stopped working for `DB_Table` after adding making the `Table` constructor `private`.
- Clearer error when reading a SQLite database from a remote file (tells the user to download it first).
- Follow up - correlate asset id of the data link:
#9869
- Follow up - include project name (once bug is fixed):
#9875
- Some problems/improvements of the audit log:
- The audit log system is not yet ready for high throughput of logs
#9870
- The logs may be lost if `System.exit` is used
#9871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work p-low Low priority
Projects
Status: 📤 Backlog
Development

No branches or pull requests

1 participant