Support stream Chunk reading in Python API #974

WatweA · 2023-09-18T15:55:24Z

In the Python mcap reader API, chunks are read into memory as full blocks. This often causes OOM errors on smaller systems.

Main reader loop:

mcap/python/mcap/mcap/reader.py

Lines 294 to 318 in a68c769

    
           if isinstance(next_item, ChunkIndex): 
        
               self._stream.seek(next_item.chunk_start_offset + 1 + 8, io.SEEK_SET) 
        
               chunk = Chunk.read(ReadDataStream(self._stream)) 
        
               for index, record in enumerate( 
        
                   breakup_chunk(chunk, validate_crc=self._validate_crcs) 
        
               ): 
        
                   if isinstance(record, Message): 
        
                       channel = summary.channels[record.channel_id] 
        
                       if topics is not None and channel.topic not in topics: 
        
                           continue 
        
                       if start_time is not None and record.log_time < start_time: 
        
                           continue 
        
                       if end_time is not None and record.log_time >= end_time: 
        
                           continue 
        
                       if channel.schema_id == 0: 
        
                           schema = None 
        
                       else: 
        
                           schema = summary.schemas[channel.schema_id] 
        
                       message_queue.push( 
        
                           ( 
        
                               (schema, channel, record), 
        
                               next_item.chunk_start_offset, 
        
                               index, 
        
                           ) 
        
                       )

Chunk data read line:

mcap/python/mcap/mcap/records.py

Line 173 in 0a06331

data = stream.read(data_length)

wkalt · 2023-09-18T17:03:46Z

I think this idea is good generally. The main constraint is that to make use of the message index, you need to decompress the whole chunk first. But if you don't care about the message index and you're just doing a linear file read, then streaming decompression of chunks is what you want for sure. We support this in the go library but none of the others afaik.

I think the main place where you benefit from this is when reading a lot of mcap files at once (for instance, merging them together). OOMs during reading of a single MCAP file may also be addressed, but not if they are caused by oversized individual messages, which is the most common reason for oversized chunks in my experience. Absent massive messages, if the writer is configured with a sane chunk size then OOMs should not generally be encountered when reading a single file -- the default decompressed chunk sizes are in the single-digit megabytes, and you generally only need one at a time open.

So to summarize, I think the feature is good but it would be worth checking if there's anything odd going on with your chunking, if the context here is reading single files. You can do this with

mcap info file.mcap

and

mcap list chunks file.mcap

which will show all the individual sizes. There was an old bug in the rust writer that resulted in single-chunk files. I'm curious if you may be on an old version of the rust writer that still has this behavior:
#777

or are creating huge chunks in some other way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stream Chunk reading in Python API #974

Support stream Chunk reading in Python API #974

WatweA commented Sep 18, 2023 •

edited

wkalt commented Sep 18, 2023

Support stream Chunk reading in Python API #974

Support stream Chunk reading in Python API #974

Comments

WatweA commented Sep 18, 2023 • edited

wkalt commented Sep 18, 2023

WatweA commented Sep 18, 2023 •

edited