Add support for zstd content compression #423

rcombs · 2020-09-09T04:06:39Z

The other compression schemes supported in Matroska suffer from having to be applied on a per-packet level, with no global state. This means that they can't exploit any of the redundancy between packets, and any huffman tables or other configuration has to be duplicated in every packet. This severely limits the attainable compression efficiency to approximately the level attained by zlib, which has led to the other algorithms receiving very little adoption by content authors, and little support from tool vendors.

The zstd library, on the other hand, supports generating a dictionary from a large number of small inputs (e.g. packets), and using that dictionary to compress similar inputs more efficiently. This improves performance on subtitle inputs dramatically.

Muxing tools can either scan the input and pass its packets into zstd before muxing an individual file, or provide an auxiliary tool that generates a dictionary from the packets of one or more input files, then saves the result to a file that can be reused when muxing tracks with similar content.

The other compression schemes supported in Matroska suffer from having to be applied on a per-packet level, with no global state. This means that they can't exploit any of the redundancy between packets, and any huffman tables or other configuration has to be duplicated in every packet. This severely limits the attainable compression efficiency to approximately the level attained by zlib, which has led to the other algorithms receiving very little adoption by content authors, and little support from tool vendors. The zstd library, on the other hand, supports generating a dictionary from a large number of small inputs (e.g. packets), and using that dictionary to compress similar inputs more efficiently. This improves performance on subtitle inputs dramatically. Muxing tools can either scan the input and pass its packets into zstd before muxing an individual file, or provide an auxiliary tool that generates a dictionary from the packets of one or more input files, then saves the result to a file that can be reused when muxing tracks with similar content.

mcr · 2020-09-09T15:07:26Z

It seems like the dictionary from zstd needs to either be included somewhere at the beginning of the matroska file, or has to be provided along with it. Is there a citable open specification for zstd, other than source code?

rcombs · 2020-09-09T17:15:15Z

The dictionary goes in ContentCompSettings, as with the removed bytes in header-stripping. It's optional (i.e. if no ContentCompSettings element is present, the compression was done without an explicit dictionary).

Zstandard is defined by RFC8478.

mcr · 2020-09-10T20:23:08Z

The dictionary goes in ContentCompSettings, as with the removed bytes in header-stripping. It's optional (i.e. if no ContentCompSettings element is present, the compression was done without an explicit dictionary).

Zstandard is defined by RFC8478.

Cool, I didn't know that had been published.

robUx4 · 2020-09-13T08:02:02Z

This seems like a nice addition. Although for now we concentrate on adding existing feature (removing unused ones). This seems like something that should go in the next version of Matroska. In particular parsers up to v4 do not expect that value. When it is new elements, they know they can skip them. When it's values in an enum like this, that affects parsing the block data that could make new files unreadable by existing parsers.

To avoid this, the muxer should mark the file as only readable from Matroska version 5.

Technically there is currently no way to define a minver/maxver value for an enum value. So we need to add that (something to add to the EBML Schema format). It's not defined but you could just add minver="5" to the new enum value.

As for the compression algorithm itself, the fact it's defined by a RFC is a big bonus (free to use). I wonder how practical it would be. It seems that you can only get a proper directory if you scan all your sources ahead of muxing. Otherwise you use a lowest common denominator for a particular codec but it's less efficient. And in that case will need to create their own dictionary. It's feasible but I wonder if there's much gain to expect from the other compression mechanisms. Compression is already not good on compressed codec (unlike header stripping), so that limit the compression to raw formats (audio, video, bitmap, text).

rcombs · 2020-09-13T08:15:50Z

I'd expect this to mainly be useful for subtitle formats. Some quick testing on real files indicated theoretical improvements in the range of 2.5~3x over zlib in typesetting-heavy cases.

Zlib can actually use user-supplied dictionaries as well, and in a very similar way to how zstd does. The problem is that there doesn't appear to be any decent tooling available to generate dictionaries for zlib, whereas the zstd library includes functions to generate a dictionary from a passed-in data set.

robUx4 · 2020-09-13T08:36:46Z

I think lzo1x can handle dictionaries too, but I couldn't find the code/link about that.

robUx4 · 2021-11-28T08:17:47Z

Now that zstd is RFC8478 that's an extra incentive to support it.

Since we discourage the use of some compression values

Decoding implementations MAY support methods "1" and "2" as possible

We might as well add a new value that is also known not to work on all implementations. In v5 we could make it mandatory.

robUx4 · 2022-01-23T07:37:53Z

@rcombs could you rebase and adapt so it can be merged ?

robUx4 · 2022-02-13T10:05:59Z

Looking at RFC8478 I wonder if we should add constraints to how zstd would be used. There is a Magic Number at the start that could be stripped, although it could be combined into a separate ContentEncodings. There can be more that one frame, each with a magic number. It would be nice to be able to strip all the magic numbers as well. Not all zstd libraries may be ready to parse such a stream directly, so it might be necessary to reconstruct the zstd frames. That may lose some of the speed advantage because of some memory copy.

There are also metadata frames which don't really make sense in the context of Block compression as the container compression(s) is supposed to be transparent to the Block reader. We may mention that they SHOULD NOT be used. If some metadata are needed, there is BlockAdditions for that.

robUx4 · 2022-08-28T05:39:50Z

Looking at the latest RFC it also seems dictionaries are not a thing yet.

robUx4 · 2023-10-08T11:39:27Z

Marking as Matroska v5 as v4 publication is close.

robUx4 added clarifications format addition spec_main Main Matroska spec document target labels Sep 13, 2020

robUx4 mentioned this pull request Sep 22, 2020

there is no 'minver' attribute for enum values ietf-wg-cellar/ebml-specification#390

Open

mcr added the matroska-v5 label Dec 1, 2020

robUx4 mentioned this pull request Mar 14, 2021

Leave elements added in v5 out of the Matroska RFC #463

Closed

robUx4 added the needs_to_be_rebased label Mar 14, 2021

This was referenced Apr 4, 2021

add minver/maxver attributes to enum values ietf-wg-cellar/ebml-specification#406

Open

Remove encryption elements that are not used #45

Closed

robUx4 mentioned this pull request Aug 8, 2021

Add some encryption/compression links #530

Merged

robUx4 removed the matroska-v5 label Nov 28, 2021

robUx4 mentioned this pull request Feb 13, 2022

Describe how to use consecutive ContentEncoding #580

Closed

robUx4 force-pushed the master branch from 5b4360d to 7c1fdea Compare December 29, 2022 16:11

robUx4 added the matroska-v5 label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for zstd content compression #423

Add support for zstd content compression #423

rcombs commented Sep 9, 2020

mcr commented Sep 9, 2020

rcombs commented Sep 9, 2020

mcr commented Sep 10, 2020

robUx4 commented Sep 13, 2020

rcombs commented Sep 13, 2020

robUx4 commented Sep 13, 2020

robUx4 commented Nov 28, 2021

robUx4 commented Jan 23, 2022

robUx4 commented Feb 13, 2022

robUx4 commented Aug 28, 2022

robUx4 commented Oct 8, 2023

Add support for zstd content compression #423

Are you sure you want to change the base?

Add support for zstd content compression #423

Conversation

rcombs commented Sep 9, 2020

mcr commented Sep 9, 2020

rcombs commented Sep 9, 2020

mcr commented Sep 10, 2020

robUx4 commented Sep 13, 2020

rcombs commented Sep 13, 2020

robUx4 commented Sep 13, 2020

robUx4 commented Nov 28, 2021

robUx4 commented Jan 23, 2022

robUx4 commented Feb 13, 2022

robUx4 commented Aug 28, 2022

robUx4 commented Oct 8, 2023