Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stream Chunk reading in Python API #974

Open
WatweA opened this issue Sep 18, 2023 · 1 comment
Open

Support stream Chunk reading in Python API #974

WatweA opened this issue Sep 18, 2023 · 1 comment

Comments

@WatweA
Copy link

WatweA commented Sep 18, 2023

In the Python mcap reader API, chunks are read into memory as full blocks. This often causes OOM errors on smaller systems.

Main reader loop:

if isinstance(next_item, ChunkIndex):
self._stream.seek(next_item.chunk_start_offset + 1 + 8, io.SEEK_SET)
chunk = Chunk.read(ReadDataStream(self._stream))
for index, record in enumerate(
breakup_chunk(chunk, validate_crc=self._validate_crcs)
):
if isinstance(record, Message):
channel = summary.channels[record.channel_id]
if topics is not None and channel.topic not in topics:
continue
if start_time is not None and record.log_time < start_time:
continue
if end_time is not None and record.log_time >= end_time:
continue
if channel.schema_id == 0:
schema = None
else:
schema = summary.schemas[channel.schema_id]
message_queue.push(
(
(schema, channel, record),
next_item.chunk_start_offset,
index,
)
)

Chunk data read line:
data = stream.read(data_length)

@wkalt
Copy link
Contributor

wkalt commented Sep 18, 2023

I think this idea is good generally. The main constraint is that to make use of the message index, you need to decompress the whole chunk first. But if you don't care about the message index and you're just doing a linear file read, then streaming decompression of chunks is what you want for sure. We support this in the go library but none of the others afaik.

I think the main place where you benefit from this is when reading a lot of mcap files at once (for instance, merging them together). OOMs during reading of a single MCAP file may also be addressed, but not if they are caused by oversized individual messages, which is the most common reason for oversized chunks in my experience. Absent massive messages, if the writer is configured with a sane chunk size then OOMs should not generally be encountered when reading a single file -- the default decompressed chunk sizes are in the single-digit megabytes, and you generally only need one at a time open.

So to summarize, I think the feature is good but it would be worth checking if there's anything odd going on with your chunking, if the context here is reading single files. You can do this with

mcap info file.mcap

and

mcap list chunks file.mcap

which will show all the individual sizes. There was an old bug in the rust writer that resulted in single-chunk files. I'm curious if you may be on an old version of the rust writer that still has this behavior:
#777

or are creating huge chunks in some other way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants