proposal: secondary indexes #918

james-rms · 2023-07-03T00:21:12Z

Adds new record types which facilitate fast message lookup by timestamps other than log_time. These closely mirror the existing log_time index structure, but can be based on any user-defined timestamp value.

This PR is currently only for review. Before merging, we would:

Implement secondary index writing in the Go client
Implement secondary indexed reading in the Go client
Add an option to add a secondary index by publish time or ROS header stamp to the mcap CLI

Background

Roboticists need detailed time information in order to understand exactly what happened when for their robot. For a given frame or message, one or more of these may be recorded:

Log time: when the message was received by the recorder
Publish time: when the message was published by the originator
Event timestamp: for messages about events that happened before publishing, they may include a timestamp that records when the event happened (which can be a significant time before publishing the message).
Window start and/or end time: Some messages contain information about a window of time (point cloud scans, for example).

Since most robotics frameworks are designed to allow for distributed compute and recording, we assume that some messages are out-of-order for every timestamp ordering within an MCAP file.

Justification

Right now the MCAP format includes an optional index on log_time. However, a roboticist may want to use any of these other timestamp categories to view messages within a given time window, or in a particular time order. Some example use-cases are:

Locating all messages derived from a single camera frame. The camera frame's event timestamp would be the same as the event timestamp for all downstream data messages. (searches by event timestamp)
Locating all point cloud scan frame(s) from multiple LiDARs that overlap with a given event. (searches by event timestamp, window start and window end time)
Analyzing jitter in the publish rate from a given node (reads in publish time order)

Without index information in the file, the above use-cases can be served either by performing a windowed sort (which is how Studio currently supports ordering by header timestamp) or by reading the entire file. A windowed sort has a chance to produce incorrect results, if an out-of-order message appears outside the window.

Paths not taken

There were a few options I considered and rejected during development, listed here for discussion.

index arbitrary key types, not just timestamps: My initial stab at this allowed any type to be an index key, not just timestamps. However, the index structure we use here and for log_time assumes that chunk key ranges are almost-disjoint. We do a top-level search through chunk indexes to find the few chunks that overlap with our time range, then a bottom-level search through message indices to find specific messages. If the indexed keys had a random distribution across the file, all chunk ranges would overlap completely, and the top-level search would lose effectiveness. Message timestamps of all kinds tend to be roughly in increasing order across a file, which provides the needed almost-disjoint chunk ranges. I may still add advice to the specification regarding index effectiveness for random timestamps.
- we could try to support keys with arbitrary types and distributions, but we'd need to build a more sophisticated search index, and I don't see an obvious way to do that without requiring that the writer store the entire index in memory until the summary section is written.
- we may still want to extend this proposal to allow indexing by message sequence count. That's not a timestamp, but it should still produce almost-disjoint chunk ranges.
merging Secondary Chunk Index information into the existing Chunk Index record: Right now we don't have a great story for extending the specification of existing records, so I chose to add all new data into new record types.
include secondary key start/end in (or close to) the Chunk record: It might be nice for completeness to include information about secondary key start/end in the chunk record, or in a separate record just before the chunk. This would allow linear readers to skip parsing chunks they don't need. This is still a path we could go down, but I consider it a niche use-case that would be more complex to implement than it's worth.

james-rms · 2023-07-03T01:18:42Z

cc. @jhurliman

wkalt · 2023-07-10T16:39:49Z

website/docs/spec/index.md

@@ -179,6 +183,30 @@ The message encoding and schema must match that of the Channel record correspond
 | 8     | publish_time | Timestamp | Time at which the message was published. If not available, must be set to the log time.                         |
 | N     | data         | Bytes     | Message data, to be decoded according to the schema of the channel.                                             |

+### Secondary Index Key (op=0x10)


I think we should come up with a more future-proof word than "secondary", in case we decide in a V2 to merge the primary indexes into this same record type.

wkalt · 2023-07-10T16:46:17Z

website/docs/spec/index.md

+> before reading into the Data section. This means that the Secondary Index Key in the Data section
+> is not normally used. However, if a MCAP is truncated and the summary section is lost, having the
+> Secondary Index Key appear before any Secondary Message Index records allows the MCAP to be fully
+> recovered.


Why is it necessary to have this record type, vs the parser just inferring from the existence of the other two record types that the index is in use? I can see dropping this would require moving the "name" field into the secondary chunk index record but that doesn't seem like the biggest thing we stick in those records anyway.

Today the parser knows a file is indexed via the presence of the index records - we don't need a third record type for that right?

I guess I should follow up with a concrete suggestion - what about

SecondaryMessageIndex name channel_id records SecondaryChunkIndex name chunk_start_offset first_key last_key message_index_offsets metadata (??? - will elaborate in another comment)

then in the summary offset section, we'd have a new group pointing at "SecondaryChunkIndex". The "name" key is stored in both locations to allow a partially-written file to still have index data recovered, which is a purpose your third record type also supplies. The cost is the duplication of the "name" field in the SecondaryMessageIndex records. It would be good to get some data on how much this costs us - my assumption is that the effect would generally be swamped by the size of "records", this it doesn't justify the extra record type.

forget the "metadata" part - I think it would be better implemented with a "chunk info" record as described in another comment.

wkalt · 2023-07-10T16:47:38Z

website/docs/spec/index.md

+| ----- | ------------------ | --------------------------------- | -------------------------------------------------------------------------------------------------------------- |
+| 2     | channel_id         | uint16                            | Channel ID.                                                                                                    |
+| 2     | secondary_index_id | uint16                            | Secondary Index ID.                                                                                            |
+| 4 + N | records            | `Array<Tuple<Timestamp, uint64>>` | Array of timestamp and offset for each record. Offset is relative to the start of the uncompressed chunk data. |


should be key, offset I think

wkalt · 2023-07-10T17:02:07Z

I think rather than conceptualizing this as "secondary indexes", it would be better to think about it as "custom index support" with the intention of it eventually absorbing the existing chunk/message index scheme as well.

I think once we have this "secondary index scheme", what we'd probably want to do is split a "chunk info" record out of the existing "chunk index" record, to contain stuff like compression stats and chunk start/end that aren't really relevant to message indexes. Then your "secondary chunk index" can I think be used as the only kind of chunk index.

If we want to we could also add to "chunk info" some "metadata" or "typed statistics" kind of field to allow custom index authors to display custom chunk statistics from the info command.

wkalt · 2023-07-10T17:08:23Z

website/docs/spec/index.md

+[Summary Offset 0x01]
+[Summary Offset 0x05]
+[Summary Offset 0x07]
+[Summary Offset 0x08]


I think these summary offsets seem wrong, 0x01 is the header and that's not in the summary section.

I think they should be 0x03, 0x04, 0x10, 0x08, 0x12, 0x0A, 0x0B

wkalt · 2023-07-10T17:47:23Z

Adds new record types which facilitate fast message lookup by timestamps other than log_time. These closely mirror the existing log_time index structure, but can be based on any user-defined timestamp value.

This PR is currently only for review. Before merging, we would:
* [ ]  Implement secondary index writing in the Go client

* [ ]  Implement secondary indexed reading in the Go client

* [ ]  Add an option to add a secondary index by publish time or ROS header stamp to the `mcap` CLI

I think we'll also want to slot the integrations into whatever existing info & summarization subcommands would touch it (probably at least "mcap info" and "mcap list chunks", "mcap cat"). That would provide a good demonstration mechanism.

Background

4. Window start and/or end time: Some messages contain information about a window of time (point cloud scans, for example).

For this, would you be including two secondary indexes under the proposed scheme? You would get sort by start/end for free but additional work/first-class treatment would probably be required if supporting search by overlap (which seems more niche for this case but I'm not sure what else people may need).

Since most robotics frameworks are designed to allow for distributed compute and recording, we assume that some messages are out-of-order for every timestamp ordering within an MCAP file.

Justification

Right now the MCAP format includes an optional index on log_time. However, a roboticist may want to use any of these other timestamp categories to view messages within a given time window, or in a particular time order. Some example use-cases are:
1. Locating all messages derived from a single camera frame. The camera frame's event timestamp would be the same as the event timestamp for all downstream data messages. (searches by event timestamp)

2. Locating all point cloud scan frame(s) from multiple LiDARs that overlap with a given event. (searches by event timestamp, window start and window end time)

3. Analyzing jitter in the publish rate from a given node (reads in publish time order)
Without index information in the file, the above use-cases can be served either by performing a windowed sort (which is how Studio currently supports ordering by header timestamp) or by reading the entire file. A windowed sort has a chance to produce incorrect results, if an out-of-order message appears outside the window.

Paths not taken

There were a few options I considered and rejected during development, listed here for discussion.
* **index arbitrary key types, not just timestamps**: My initial stab at this allowed any type to be an index key, not just timestamps. However, the index structure we use here and for `log_time` assumes that chunk key ranges are almost-disjoint. We do a top-level search through chunk indexes to find the few chunks that overlap with our time range, then a bottom-level search through message indices to find specific messages. If the indexed keys had a random distribution across the file, all chunk ranges would overlap completely, and the top-level search would lose effectiveness. Message timestamps of all kinds tend to be roughly in increasing order across a file, which provides the needed almost-disjoint chunk ranges. I may still add advice to the specification regarding index effectiveness for random timestamps.
  
  * we could try to support keys with arbitrary types and distributions, but we'd need to build a more sophisticated search index, and I don't see an obvious way to do that without requiring that the writer store the entire index in memory until the summary section is written.

I think even constraining to a uint64 rather than a "Timestamp" would probably be a win. If the user records a massively disordered file based on their desired ordering, it seems ok to me to let tooling fall over or handle as desired, rather than prohibiting via spec. That's the same approach we have taken with the existing timestamps. That would knock out the sequence ID feature as well.

Operations like "get 50 messages ordered on $secondary starting at t" would still be quicker than if the file had no index.

* **merging Secondary Chunk Index information into the existing Chunk Index record**: Right now we don't have a great story for extending the specification of existing records, so I chose to add all new data into new record types.

+1

* **include secondary key start/end in (or close to) the Chunk record**: It might be nice for completeness to include information about secondary key start/end in the chunk record, or in a separate record just before the chunk. This would allow linear readers to skip parsing chunks they don't need. This is still a path we could go down, but I consider it a niche use-case that would be more complex to implement than it's worth.

+1. I think the "record before chunk" idea is interesting but easily splittable from this.

wkalt · 2023-07-10T17:50:47Z

website/docs/spec/registry.md

+
+## Secondary index keys
+
+The Secondary Index Key `name` field may contain the following options:


Does "may" indicate I can put my own stuff in there and expect some tooling support? I think this would be good to shoot for. Rather than having studio or whatever hard code "header.stamp", "publish_time", etc, would it be viable to dynamically show a list of sort options based on the file's index section?

And likewise with the info command, CLI, reader support etc.

Yeah, i feel like you should expect some tooling support for any key, but tooling can make extra assumptions about well-known keys.

wkalt · 2023-07-10T18:09:24Z

website/docs/spec/index.md

+| 2     | secondary_index_id    | uint16                | Secondary Index ID.                                                                                                                                                                                                               |
+| 8     | chunk_start_offset    | uint64                | Offset to the chunk record from the start of the file.                                                                                                                                                                            |
+| 8     | earliest_key          | Timestamp             | Earliest key in the chunk. Zero if the chunk contains no messages with this key.                                                                                                                                                  |
+| 8     | latest_key            | Timestamp             | Latest key in the chunk. Zero if the chunk contains no messages with this key.                                                                                                                                                    |


using a zero timestamp as a sentinal is something we do elsewhere but not strictly correct, since 0 is a valid timestamp.

what about saying chunk indexes should be omitted when they would apply to no messages?

wkalt · 2023-08-28T17:51:45Z

another thing to think about (I don't remember if it's part of this patch) is indexes that are chunk-level only. Since remote readers need to download the full chunk to make use of an index on it, and can probably decompress/scan the chunk faster than they can download it (not guaranteed but common) the benefits of the message level index can be pretty marginal, however the chunk-level index at the end of the file may still be very useful. Similar to postgres BRIN indexes https://www.postgresql.org/docs/15/brin-intro.html

james-rms requested review from jtbandes, defunctzombie, amacneil and wkalt July 3, 2023 00:21

github-actions bot deployed to mcap (Preview) July 3, 2023 00:26 View deployment

proposal: secondary indexes

6d905c7

james-rms force-pushed the jrms/custom-index branch from 0ca9916 to 6d905c7 Compare July 3, 2023 00:33

github-actions bot deployed to mcap (Preview) July 3, 2023 00:38 View deployment

wkalt reviewed Jul 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: secondary indexes #918

proposal: secondary indexes #918

james-rms commented Jul 3, 2023 •

edited

james-rms commented Jul 3, 2023

wkalt Jul 10, 2023

wkalt Jul 10, 2023

wkalt Jul 10, 2023

wkalt Jul 10, 2023

wkalt Jul 10, 2023

wkalt commented Jul 10, 2023

wkalt Jul 10, 2023

wkalt commented Jul 10, 2023

Background

Justification

Paths not taken

wkalt Jul 10, 2023

james-rms Jul 18, 2023

wkalt Jul 10, 2023

wkalt Jul 10, 2023

wkalt commented Aug 28, 2023


		## Secondary index keys

		The Secondary Index Key `name` field may contain the following options:

proposal: secondary indexes #918

Are you sure you want to change the base?

proposal: secondary indexes #918

Conversation

james-rms commented Jul 3, 2023 • edited

Background

Justification

Paths not taken

james-rms commented Jul 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wkalt commented Jul 10, 2023

Choose a reason for hiding this comment

wkalt commented Jul 10, 2023

Background

Justification

Paths not taken

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wkalt commented Aug 28, 2023

james-rms commented Jul 3, 2023 •

edited