Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement concat command based on merge command #1010

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kevswims
Copy link

@kevswims kevswims commented Nov 7, 2023

Public-Facing Changes

Adds a concat command to the mcap CLI that combines files with the timestamps rewritten sequentially starting from 0.

Description

We are using this along with the filter command to take small chunks from a lot of files and combine them into one file that we can run simulations on.

@james-rms
Copy link
Collaborator

This proposal seems specific to your use-case.

MCAP files don't define a clear start and end of recording time, they only record the time of the first message and the time of the last message. Therefore, when joining start-to-end like this, you have to introduce some "gap" (you choose 100ms here) so as not to have a discontinuity in the message rate. That gap is always going to be an approximation, and the correct value depends entirely on how the MCAPs were recorded.

I could imagine decomposing this concept into mcap timeshift then mcap merge, forcing the user to make a considered decision about how much time to shift the second file by.

@kevswims
Copy link
Author

kevswims commented Nov 7, 2023

@james-rms
I don't think this is that specific to our use-case. I just used the 100ms gap to prevent the last message from one file and the first message on the next from having the same timestamp. The actual difference between those messages really does not matter at all for what we are doing. I could add a flag to set the gap if that would help out other use-cases.

Our use-case is to combine multiple captures together that have been recorded at different times in a way that we can easily run through our playback infrastructure and get a single output that covers a lot of different scenarios. Your suggestion of a timeshift command would work for this but it feels less elegant since the program that runs this combination would have to query out the timestamps with mcap info and then parse the start and end times out of that output. Implementing something like that would require that mcap info have some sort of machine readable format like json that it could output for the tool to parse since I don't like relying on the format of a command's output to stay the same.

@james-rms
Copy link
Collaborator

@kevswims are you able to go into more detail on what other use-cases you think this would be useful for?

@kevswims
Copy link
Author

kevswims commented Nov 9, 2023

The use cases we have identified are:

  • Combining data collected at different times into one file for running through simulation and automated test applications
  • Combining data to show for presentations where we don't want to be flipping between multiple Foxglove instances to show different scenarios
  • Combining data into tagged datasets for running through AI training. An example here is we could pick out every scene where the system sees a car and combine them into one file to use for training so we don't have to manage lots of small files.

To create this file with this code we have already written some scripts that take a CSV file that indicates all of the source files and what start and end timestamps that we want and it cuts them up and combines them.

Think of this like making a video, we go out and record lots of data, most of which is duplicate or useless but we want to edit that down and combine multiple clips into one thing that tells the story. I could also see a tool like this that would integrate with the Foxglove Data Platform to allow anyone to easily trim and combine mcap files to create these datasets.

@kevswims
Copy link
Author

@james-rms did the use-cases I provided make sense for this?

@james-rms
Copy link
Collaborator

@kevswims They make sense, but we haven't seen anyone else ask for this, or work their own solutions out for this problem. I think until we see more interest from the community on this, it makes the most sense for this patch to remain in your fork.

@wkalt
Copy link
Contributor

wkalt commented Apr 22, 2024

I think concatenation has good usecases, particularly if you can concatenate without decompressing chunks, which will require some more tricks than this but should be totally feasible. The usecase I have in mind is more around compacting small files into larger ones for better/more consistent performance in cloud storage.

For me it would suffice to have a smarter "merge" command, that looks at the indexes and determines a plan of "concat" and "merge" operations that minimizes the number of file handles/chunks open at one time. If you supply it non-overlapping data, it will perform a pure concatenation. Otherwise it will concatenate where it can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants