Audit Logs should be able to handle pressure #9870

radeusgd · 2024-05-07T10:03:15Z

For efficiency, the logs are sent on a background thread.

There can be a problem if the logs are recorded faster than the background thread is able to process them, especially as currently they are sent one-by-one.

On my machine, the round-trip for sending a log message to the cloud usually takes around 100ms but can peak at even 2s per message. This means that in good conditions the maximum throughput can be at around 10 log events per second. This seems to be okay-ish for our current use cases, but it may very easily be overwhelmed if a big operation is performed that runs a lot of queries quickly. If the operation runs for a short time, the pending logs will be queued and should be sent with some delay. The problem starts if the operation keeps running for a longer time at a too high throughput: more and more log messages will be queued and the system may be unable to keep up.

We have two values that we need to weight:

efficiency
reliability of log events

The messages are sent in background for efficiency, but if the system is overwhelmed that may lead to problems with reliability.

Solutions to consider:

block the main thread once too many log messages are pending
- This would allow the background thread to 'unclog' by processing the pending messages, ensuring that not too many wait in the queue at once.
- IMO this is a reasonable compromise between efficiency and reliability - in usual scenarios the log messages are still sent asynchronously and do not slow down execution at all, but in case that the throughput is exceeded, they start blocking to ensure the background thread has some chance to keep up.
batch sent messages
- Currently the messages are sent one-by-one. This is not very efficient as ~100ms time is mostly due to the round-trip latency. Increasing the payload a bit - e.g. by sending ~10-100 messages in one batch could result in similar latency, while achieving a much larger throughput.
- We would still send the messages as they come with no delay. Just if after processing the current message, more than 1 messages are already pending (the messages were enqueued faster than we processed them), we could send multiple messages in a single batch to ensure more efficient processing.
- This would require changes in the Cloud to allow the /logs endpoint to accept multiple messages in a single request.

The text was updated successfully, but these errors were encountered:

radeusgd · 2024-05-07T10:03:51Z

Setting as low priority, because while this is a problem, current workloads should not be likely to encounter it too much.

- Closes #9599 - Implemented API for sending audit logs to the cloud on a background thread. - If the Postgres connection is opened through a datalink, its internal JDBC connection is replaced by a wrapper that reports executed queries to the audit log. - Also introduces `EnsoMeta` - a helper Java class that can be used in our helper libraries to access Enso types. - I have replaced the common pattern scattered throughout the codebase with calls to this 'library' to avoid repetitive code. - Refactored `Table.display` to share code between in-memory and DB - it was needed as the function stopped working for `DB_Table` after adding making the `Table` constructor `private`. - Clearer error when reading a SQLite database from a remote file (tells the user to download it first). - Follow up - correlate asset id of the data link: #9869 - Follow up - include project name (once bug is fixed): #9875 - Some problems/improvements of the audit log: - The audit log system is not yet ready for high throughput of logs #9870 - The logs may be lost if `System.exit` is used #9871

radeusgd added p-low Low priority -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work labels May 7, 2024

radeusgd mentioned this issue May 7, 2024

Audit Logs for Postgres connections opened through a data link #9873

Merged

4 tasks

AdRiley assigned radeusgd May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit Logs should be able to handle pressure #9870

Audit Logs should be able to handle pressure #9870

radeusgd commented May 7, 2024

radeusgd commented May 7, 2024

Audit Logs should be able to handle pressure #9870

Audit Logs should be able to handle pressure #9870

Comments

radeusgd commented May 7, 2024

radeusgd commented May 7, 2024