Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/ursrv: Improve data storage and aggregation (fixes #9212) #9305

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

er-pa
Copy link
Member

@er-pa er-pa commented Dec 26, 2023

Purpose

Currently, the ursrv stores all the incoming usage reports directly in a postgres database. Once a day only a part of the data from these reports is being aggregated and stored in three separate tables. Other than that aggregation, all the previous day's reports are being loaded and parsed in the caching layer. This isn't very efficient and this way it's required to keep storing all the usage reports (approx 136k daily currently).

With this change, the aggregation takes over most of the intensive work, which happens once a day. This aggregation process now compiles an aggregated report which basically consists of a combination of the data that was previously the output of the caching-mechanism of the previous day's reports ánd all the aggregated data (three separate tables) corresponding that same day obviously. This new report-kind is stored in a blob-store (either locally on the disk or an S3-compatible storage). The usage reports are still stored as-is (in the new storage), and are not cleaned up by logic in this change. The idea is that some sort of external retention-settings can be used to clean out out-dated usage reports, whenever and however that'll be preferred. But the goal is to probably not have more than two days worth of usage reports hanging around.

In the web-side itself very little changed, only some variable names really. Most logic remained the same as well. Mostly because it's proven to work and I didn't see much need to change it significantly. However, a lot of the functionality moved from the serve-side to the aggregate-side. The serve-side now practically only contains logic to format the data into something that the front-end can handle easily. Reports from very old Syncthing instances are now simply ignored, instead of having some odd work-arounds here and there.

The caching logic was changed too. It contains the latest aggregated report (of the day before) and data which is extracted from a whole bunch of daily aggregated reports to use in graphs (versions, performance and blockstats). The report is updated whenever a new one is found, the graph-data is updated whenever there is a new report ór if some inconsistency was found (amount of reports in the store <-> amount of data-points were cached), as these two should end up being equal.

Migrate

Temporary migration support was added (--migrate option to the aggregate command). It's quite contained, so it should be easy to clean this out of the code together with any remaining database-related content. This migration-method reads out all the usage reports from the old postgres db, and compiles a (new) aggregation report per day and stores it in the new location. The continues aggregation won't start when doing this.

Old data is -not- adjusted or changed. With this in mind it is probably possible to fire up the new version besides the old one listening on different ports, run the migration and see if it all makes sense and works as expected.

Testing

That's a bit tricky. I did fire up a local ursrv and I pointed some local Syncthing instances to it. This all seemed to work as expected. Needless to say it's a bit difficult to test it more comprehensively.

(( Reason for it being a draft being that I wanted to run the checks and I'll run a couple more tests of recent changes ))

@calmh
Copy link
Member

calmh commented Dec 27, 2023

Old data is -not- adjusted or changed. With this in mind it is probably possible to fire up the new version besides the old one listening on different ports, run the migration and see if it all makes sense and works as expected.

Yes; we should do precisely this. Perhaps we can even rig up a thing to post incoming reports to both old and new and let them run concurrently while finalising/tweaking the PR, and then shift over smoothly.

(I have lots of code comments! We'll get to those in time.)

@AudriusButkevicius
Copy link
Member

AudriusButkevicius commented Dec 27, 2023

Half of the code is adding new storage stuff. What's wrong with storing things in the storage engine we have, i.e., postgres? Why do we need to pull more machinery into this.

@calmh
Copy link
Member

calmh commented Dec 27, 2023

I would like to move to something that runs easier as a stateless container with something cloud based for storage. S3 style blob stores are a dime a dozen, and we don't need to maintain a postgresql.

@calmh
Copy link
Member

calmh commented Jan 13, 2024

I did some hacks and started running this. The migration is not great at the moment, it generates "empty" aggregated JSON files (i.e., all values null and zero), possibly because it ignores the old versions that generated those reports. (It also takes about an hour per day to do it, after multiple hours to find the minimum date to start from... We should either optimise that query to use an index, or we can add it as a command line parameter and default to 2014-06-11 which is what the query results in :p)

In any case migration looks like it would take months to complete with the current method, so we may need to think on how to do that faster.

20140613.json

@calmh
Copy link
Member

calmh commented Jan 16, 2024

Upon further consideration, I think the current aggregation is too lossy. It's fine for its current purpose: displaying the data we're currently interested in in the GUI. However, it's rather limiting to say that we will never want to do any other form of analysis of the data. Instead, let's store a higher-fidelity aggregation... Starting with the contract.Report, we already have a couple of convenience methods to flatten it (FieldPointers and FieldNames). I would like to store an aggregated report that has data for each and every of these fields:

  • For int and float64 fields, we should store min, max, median, average, 5%ile, 95%ile, sum, count of non-zero
  • For map-of-int fields, the above per field
  • For boolean fields, count of true
  • For string fields, a histogram

Now, not all of the metrics will be relevant/sensible for each field (e.g, sum of NumCPU does not measure something we care about, and for many of the feature-enabled metrics we care mostly about count-of-nonzero vs total count), but if we do the same everywhere we already have a sensible handling by default of any new fields that may be introduced, something we're otherwise likely to forget.

I would express this aggregated report as a protobuf message something like this;

message Aggregation {
  int64 date = 1;
  int64 count = 2;
  repeated Statistic statistics = 3;
}

message Statistic {
  string key = 1;
  oneof statistic {
    FloatStatistic float = 1; // e.g., for HashPerf
    IntegerStatistic integer = 2; // e.g., for NumCPU
    map<string, int64> histogram = 3; // e.g., for Version, or Location...
    map<string, IntegerStatistic> map = 4; // e.g., for FilesystemType
  }
}

message FloatStatistic {
  int64 count = 1; // count of non-zero reports
  double min = 2;
  double max = 3;
  double med = 4;
  double nfp = 5; // 95th percentile
  double fp = 6; // fifth percentile
  double sum = 7; // divide by count for average
}

message IntegerStatistic {
  // same, but values are int64
}

That way, most data would be preserved in a usable format for graphing and tabling... The actual API format towards the GUI might well remain as-is, of course, or be changed if it's more convenient.

Generating this thing should be doable without hardcoding, by iterating over the FieldNames and FieldPointers things and switching on the actual type of the value...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants