Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specification: user-defined statistics #723

Open
wkalt opened this issue Nov 15, 2022 · 4 comments
Open

specification: user-defined statistics #723

wkalt opened this issue Nov 15, 2022 · 4 comments
Labels
feature New feature or request

Comments

@wkalt
Copy link
Contributor

wkalt commented Nov 15, 2022

Related to #384

It would be nice if there were a way for writers to define and use custom statistics in their mcap files and have them surfaced by the "info" subcommand. I think this could be implemented with a new record type like "custom statistic", which could minimally contain "name" and "value" fields. Better IMO would be to provide a little more structure, like supporting "channel statistic" and "file statistic" variants, or even "channel statistic", "file statistic", "attachment statistic", "metadata statistic", "chunk statistic". The channel statistics would have channel IDs associated with them, file statistics would be whole-file, and metadata/attachment/chunk statistics could contain references to the relevant record offset. I am imagining all these records would be written to the summary section somewhere, I guess in a new statistics section. The "info" command could then display file and channel statistics, and the list attachments/chunks/metadata/channels commands could show the relevant type of statistic as well.

Some additional things to think about:

  • what about within-chunk channel statistics?
@wkalt wkalt added the feature New feature or request label Nov 15, 2022
@james-rms
Copy link
Collaborator

  1. This seems like a reasonably clean way to head off a category of statistic addition requests.
  2. One awkward point here is that we have no floating-point value type in the MCAP spec as it currently stands.
  3. It's also a little awkward that we need to add a new opcode for every concept that one might want to have a statistic about: Channel, Chunk, File, Attachment, Metadatat etc.

@wkalt
Copy link
Contributor Author

wkalt commented Nov 17, 2022

regarding 2 - I think we would probably just introduce one. The representation in use in file formats I have encountered is IEEE 754.

about 3, I agree - however I'm not sure the alternatives are better. I can imagine two ways to handle it. We could go with a scheme that looks like this:

Statistic
Type: channel | attachment | chunk | metadata | ...
Name: string
Value: float64
Reference: channelID | attachment offset | chunk offset | ??? | ...

or split out different record types as the OP suggests. I don't know with the previous scheme how we encode "chunk channel statistics" if that is something we would want to provide, and the implied dispatch on user-supplied strings to interpret the "reference" field seems kind of brittle to me. But I think at the end of the day you could use make either approach work.

@wkalt
Copy link
Contributor Author

wkalt commented Oct 22, 2023

Another interesting application for custom statistics would be accelerating text search using trigrams. Consider that in typical robotics data, text values within specific fields are extremely conserved: they will generally derive from a finite number of error strings that exist in code somewhere. Typical searches will be for rare strings like "ERROR", not common strings like "success".

Imagine a post-processing step that creates a chunk or file-level statistic for each text field contained within the chunk or file. The value of the statistic is a bit vector, maybe 8 or 12 bytes long. During post-processing, string values for each string field are decomposed into trigrams. The trigrams are hashed into the vector and combined with a bitwise OR.

To execute a search, the search term is decomposed into trigrams and hashed into a vector in the same manner. This vector is then checked for overlap with the index vector. If all (or sufficiently many) flipped bits in the search vector are flipped in the index vector, the file/chunk must be examined. Otherwise it can be skipped.

The same technique could be used for generic text search (without specification of fields at all), but a longer vector would be required.

@wkalt
Copy link
Contributor Author

wkalt commented Oct 22, 2023

Much of that idea is lifted from https://www.postgresql.org/docs/current/pgtrgm.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Development

No branches or pull requests

2 participants