Add `mcap du` command #1021

defunctzombie · 2023-11-19T00:49:50Z

Problem: When looking at mcap files I want to understand how much space particular topics take up. I might want to know if I have some heavyweight topics I need to trim or if some debug topic is taking a large amount of space relative to its worth. Having a command that shows me how much "space" is used by each topic can help me answer such questions.

This change introduces a new mcap du command. This command reads and mcap file and outputs "disk usage" statistics about the mcap file.

Below I show invocations of this command on some sample mcap files. The output shows the space taken by each kind of record and then the space per topic (relative to all topics). For me, the primary use-case for this command to show the "topic" information but wanted to offer both sets of "usage" for discussion. I would probably hide the "record" breakdown under a flag or remove it entirely.

$ mcap du ~/Downloads/demo_2023-11-18_08-59-16.mcap
RECORD KIND    | SUM BYTES | % OF TOTAL FILE BYTES  
-----------------+-----------+------------------------
header         |        21 |              0.000035  
data end       |         4 |              0.000007  
schema         |       872 |              0.001466  
channel        |        95 |              0.000160  
statistics     |        76 |              0.000128  
chunk          |  59343216 |             99.748138  
message index  |     50860 |              0.085489  
chunk index    |     97828 |              0.164436  
summary offset |        68 |              0.000114  
footer         |        20 |              0.000034  

TOPIC       | SUM BYTES | % OF TOTAL MESSAGE BYTES  
--------------+-----------+---------------------------
camera_h264 |   3723297 |                 6.288593  
mouse       |     29340 |                 0.049555  
camera_jpeg |  55454516 |                93.661850

$ mcap du ~/Downloads/NuScenes-v1.0-mini-scene-0061-f4fbf7b.mcap
RECORD KIND    | SUM BYTES | % OF TOTAL FILE BYTES  
-----------------+-----------+------------------------
statistics     |       456 |              0.000089  
unknown        |        30 |              0.000006  
summary offset |       102 |              0.000020  
header         |        21 |              0.000004  
chunk          | 511377621 |             99.861298  
message index  |    582526 |              0.113755  
schema         |     15152 |              0.002959  
channel        |      1786 |              0.000349  
metadata       |       211 |              0.000041  
data end       |         4 |              0.000001  
chunk index    |    109996 |              0.021480  
footer         |        20 |              0.000004  

TOPIC                                  | SUM BYTES | % OF TOTAL MESSAGE BYTES  
-----------------------------------------+-----------+---------------------------
/CAM_FRONT_RIGHT/image_rect_compressed |  31133352 |                 4.014049  
/CAM_BACK_LEFT/image_rect_compressed   |  36039727 |                 4.646631  
/CAM_BACK_LEFT/camera_info             |     62359 |                 0.008040  
/RADAR_FRONT_LEFT                      |    643465 |                 0.082962  
/CAM_FRONT/camera_info                 |     62870 |                 0.008106  
/CAM_FRONT/lidar                       |  38053088 |                 4.906216  
/CAM_FRONT_RIGHT/annotations           |    977461 |                 0.126025  
/RADAR_FRONT                           |   1408531 |                 0.181603  
/RADAR_BACK_LEFT                       |   1443803 |                 0.186151  
/CAM_BACK/image_rect_compressed        |  29820298 |                 3.844755  
/CAM_BACK_LEFT/lidar                   |  51820479 |                 6.681257  
/CAM_FRONT_LEFT/image_rect_compressed  |  36577185 |                 4.715926  
/CAM_FRONT_LEFT/camera_info            |     63989 |                 0.008250  
/gps                                   |       702 |                 0.000091  
/drivable_area                         |   3999277 |                 0.515630  
/CAM_FRONT_RIGHT/camera_info           |     62225 |                 0.008023  
/CAM_BACK_RIGHT/annotations            |    820174 |                 0.105746  
/CAM_BACK_LEFT/annotations             |    157148 |                 0.020261  
/CAM_FRONT_LEFT/annotations            |    288426 |                 0.037187  
/markers/annotations                   |    846641 |                 0.109158  
/markers/car                           |      5794 |                 0.000747  
/diagnostics                           |   3512567 |                 0.452878  
/RADAR_BACK_RIGHT                      |   1386978 |                 0.178824  
/CAM_FRONT/annotations                 |   1284639 |                 0.165630  
/CAM_FRONT/image_rect_compressed       |  32778271 |                 4.226129  
/CAM_BACK_RIGHT/image_rect_compressed  |  33640191 |                 4.337257  
/CAM_BACK_RIGHT/camera_info            |     62294 |                 0.008032  
/CAM_BACK/lidar                        |  55684474 |                 7.179444  
/CAM_BACK/annotations                  |   1606465 |                 0.207123  
/odom                                  |    371586 |                 0.047909  
/map                                   |  15768687 |                 2.033070  
/tf                                    |    268080 |                 0.034564  
/CAM_BACK_RIGHT/lidar                  |  44188116 |                 5.697210  
/CAM_FRONT_LEFT/lidar                  |  43207431 |                 5.570769  
/pose                                  |      1465 |                 0.000189  
/imu                                   |    583740 |                 0.075262  
/semantic_map                          |     59500 |                 0.007671  
/CAM_FRONT_RIGHT/lidar                 |  40571903 |                 5.230968  
/CAM_BACK/camera_info                  |     60430 |                 0.007791  
/RADAR_FRONT_RIGHT                     |    986438 |                 0.127182  
/LIDAR_TOP                             | 265299545 |                34.205288

Implementation notes:

This implementation reads the entire file. There is an opportunity for "optimization" by using the MessageIndex records to figure out the size of records without decompressing chunks. I view this out-of-scope for the v1.
This is introduced as a separate command. I can see this feeling natural under the info command as a flag.

james-rms · 2023-11-19T21:05:42Z

mcap.code-workspace

@@ -0,0 +1,19 @@
+{


what does this do?

james-rms · 2023-11-19T21:19:06Z

go/cli/mcap/cmd/du.go

+	// total message size by topic name
+	topicMessageSize map[string]uint64
+
+	totalSize uint64


totally pedantic, but if we're going to come up with totalSize by summing up all the bytes that come out of the lexer, we'll need to add 8 bytes each for the header and trailer magic.

james-rms · 2023-11-19T21:20:41Z

go/cli/mcap/cmd/du.go

+		}
+
+		printTable(os.Stdout, rows, []string{
+			"record kind", "sum bytes", "% of total file bytes",


Suggested change

"record kind", "sum bytes", "% of total file bytes",

"record kind", "sum bytes", "% of total file bytes (after chunk decompression)",

james-rms · 2023-11-19T21:22:42Z

go/cli/mcap/cmd/du.go

+	return nil
+}
+
+func printTable(w io.Writer, rows [][]string, header []string) {


use utils.FormatTable here for consistency with other commands?

james-rms

The help text for this subcommand should remind the user that this reads the entire file.

Regarding showing per-record-type usage

We've had questions around this before, sometimes non-message records can take up a surprising amount of room. For example when @amacneil was coming up with plot test data, his example file with millions of tiny messages had almost half the space taken up with MessageIndex records. We've also previously had writers who re-wrote schema records for every new channel, which can also add up. I think this is valuable to know.

Adjusting for compression

There's some nuance missing here around compression.

If i'm looking for the total % contribution of some record type or topic to the file's size on disk, I need to see the effect that compression is having on that size. Right now you display percentages where the denominator is a sum of all record bytes after decompression. this will de-emphasise the impact of uncompressed records, particularly MessageIndex records.

Given:

sum of uncompressed record bytes ru
sum of uncompressed record bytes in chunk n rc[n]
the compressed data size of chunk n tc[n]
the uncompressed data size of chunk n tu[n]
the total size of the file f

I could imagine synthesizing a record impact estimate as:

impact % = (ru + sum(rc[n] * (tc[n] / tu[n]) for all n)) / f

which assumes all bytes in a compressed chunk contribute equally to the compressed output size, which isn't true but there's no easy way to know the right answer.

If this feels too complicated to present, you could consider presenting some of these values separately and allowing the user to do their own mental arithmetic.

Schemas

IMO it's worth parsing schema records and including the schema names and encodings when presenting per-topic statistics.

Human-readable bytes

We have a humanBytes function used in info.go, you could re-use that to make the byte counts more presentable (perhaps under a flag).

github-actions bot deployed to mcap (Preview) November 19, 2023 00:54 View deployment

Add mcap du command

bf25d21

defunctzombie force-pushed the roman/add-du-command branch from 0850810 to bf25d21 Compare November 19, 2023 01:00

defunctzombie marked this pull request as ready for review November 19, 2023 01:00

defunctzombie requested review from jtbandes, wkalt and james-rms November 19, 2023 01:01

github-actions bot deployed to mcap (Preview) November 19, 2023 01:05 View deployment

james-rms reviewed Nov 19, 2023

View reviewed changes

mcap.code-workspace

@@ -0,0 +1,19 @@

{

Copy link

Collaborator

james-rms Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

james-rms reviewed Nov 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `mcap du` command #1021

Add `mcap du` command #1021

defunctzombie commented Nov 19, 2023 •

edited

james-rms Nov 19, 2023

james-rms Nov 19, 2023 •

edited

james-rms Nov 19, 2023

james-rms Nov 19, 2023

james-rms left a comment •

edited

	"record kind", "sum bytes", "% of total file bytes",
	"record kind", "sum bytes", "% of total file bytes (after chunk decompression)",

Add mcap du command #1021

Are you sure you want to change the base?

Add mcap du command #1021

Conversation

defunctzombie commented Nov 19, 2023 • edited

james-rms Nov 19, 2023

Choose a reason for hiding this comment

james-rms Nov 19, 2023 • edited

Choose a reason for hiding this comment

james-rms Nov 19, 2023

Choose a reason for hiding this comment

james-rms Nov 19, 2023

Choose a reason for hiding this comment

james-rms left a comment • edited

Choose a reason for hiding this comment

Regarding showing per-record-type usage

Adjusting for compression

Schemas

Human-readable bytes

Add `mcap du` command #1021

Add `mcap du` command #1021

defunctzombie commented Nov 19, 2023 •

edited

james-rms Nov 19, 2023 •

edited

james-rms left a comment •

edited