Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for zstd content compression #423

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rcombs
Copy link
Contributor

@rcombs rcombs commented Sep 9, 2020

The other compression schemes supported in Matroska suffer from having to be applied on a per-packet level, with no global state. This means that they can't exploit any of the redundancy between packets, and any huffman tables or other configuration has to be duplicated in every packet. This severely limits the attainable compression efficiency to approximately the level attained by zlib, which has led to the other algorithms receiving very little adoption by content authors, and little support from tool vendors.

The zstd library, on the other hand, supports generating a dictionary from a large number of small inputs (e.g. packets), and using that dictionary to compress similar inputs more efficiently. This improves performance on subtitle inputs dramatically.

Muxing tools can either scan the input and pass its packets into zstd before muxing an individual file, or provide an auxiliary tool that generates a dictionary from the packets of one or more input files, then saves the result to a file that can be reused when muxing tracks with similar content.

The other compression schemes supported in Matroska suffer from having to be
applied on a per-packet level, with no global state. This means that they can't
exploit any of the redundancy between packets, and any huffman tables or other
configuration has to be duplicated in every packet. This severely limits the
attainable compression efficiency to approximately the level attained by zlib,
which has led to the other algorithms receiving very little adoption by content
authors, and little support from tool vendors.

The zstd library, on the other hand, supports generating a dictionary from a
large number of small inputs (e.g. packets), and using that dictionary to
compress similar inputs more efficiently. This improves performance on subtitle
inputs dramatically.

Muxing tools can either scan the input and pass its packets into zstd before
muxing an individual file, or provide an auxiliary tool that generates a
dictionary from the packets of one or more input files, then saves the result
to a file that can be reused when muxing tracks with similar content.
@mcr
Copy link
Contributor

mcr commented Sep 9, 2020

It seems like the dictionary from zstd needs to either be included somewhere at the beginning of the matroska file, or has to be provided along with it. Is there a citable open specification for zstd, other than source code?

@rcombs
Copy link
Contributor Author

rcombs commented Sep 9, 2020

The dictionary goes in ContentCompSettings, as with the removed bytes in header-stripping. It's optional (i.e. if no ContentCompSettings element is present, the compression was done without an explicit dictionary).

Zstandard is defined by RFC8478.

@mcr
Copy link
Contributor

mcr commented Sep 10, 2020

The dictionary goes in ContentCompSettings, as with the removed bytes in header-stripping. It's optional (i.e. if no ContentCompSettings element is present, the compression was done without an explicit dictionary).

Zstandard is defined by RFC8478.

Cool, I didn't know that had been published.

@robUx4 robUx4 added clarifications format addition spec_main Main Matroska spec document target labels Sep 13, 2020
@robUx4
Copy link
Contributor

robUx4 commented Sep 13, 2020

This seems like a nice addition. Although for now we concentrate on adding existing feature (removing unused ones). This seems like something that should go in the next version of Matroska. In particular parsers up to v4 do not expect that value. When it is new elements, they know they can skip them. When it's values in an enum like this, that affects parsing the block data that could make new files unreadable by existing parsers.

To avoid this, the muxer should mark the file as only readable from Matroska version 5.

Technically there is currently no way to define a minver/maxver value for an enum value. So we need to add that (something to add to the EBML Schema format). It's not defined but you could just add minver="5" to the new enum value.

As for the compression algorithm itself, the fact it's defined by a RFC is a big bonus (free to use). I wonder how practical it would be. It seems that you can only get a proper directory if you scan all your sources ahead of muxing. Otherwise you use a lowest common denominator for a particular codec but it's less efficient. And in that case will need to create their own dictionary. It's feasible but I wonder if there's much gain to expect from the other compression mechanisms. Compression is already not good on compressed codec (unlike header stripping), so that limit the compression to raw formats (audio, video, bitmap, text).

@rcombs
Copy link
Contributor Author

rcombs commented Sep 13, 2020

I'd expect this to mainly be useful for subtitle formats. Some quick testing on real files indicated theoretical improvements in the range of 2.5~3x over zlib in typesetting-heavy cases.

Zlib can actually use user-supplied dictionaries as well, and in a very similar way to how zstd does. The problem is that there doesn't appear to be any decent tooling available to generate dictionaries for zlib, whereas the zstd library includes functions to generate a dictionary from a passed-in data set.

@robUx4
Copy link
Contributor

robUx4 commented Sep 13, 2020

I think lzo1x can handle dictionaries too, but I couldn't find the code/link about that.

@robUx4
Copy link
Contributor

robUx4 commented Nov 28, 2021

Now that zstd is RFC8478 that's an extra incentive to support it.

Since we discourage the use of some compression values

Decoding implementations MAY support methods "1" and "2" as possible

We might as well add a new value that is also known not to work on all implementations. In v5 we could make it mandatory.

@robUx4
Copy link
Contributor

robUx4 commented Jan 23, 2022

@rcombs could you rebase and adapt so it can be merged ?

@robUx4
Copy link
Contributor

robUx4 commented Feb 13, 2022

Looking at RFC8478 I wonder if we should add constraints to how zstd would be used. There is a Magic Number at the start that could be stripped, although it could be combined into a separate ContentEncodings. There can be more that one frame, each with a magic number. It would be nice to be able to strip all the magic numbers as well. Not all zstd libraries may be ready to parse such a stream directly, so it might be necessary to reconstruct the zstd frames. That may lose some of the speed advantage because of some memory copy.

There are also metadata frames which don't really make sense in the context of Block compression as the container compression(s) is supposed to be transparent to the Block reader. We may mention that they SHOULD NOT be used. If some metadata are needed, there is BlockAdditions for that.

@robUx4
Copy link
Contributor

robUx4 commented Aug 28, 2022

Looking at the latest RFC it also seems dictionaries are not a thing yet.

@robUx4
Copy link
Contributor

robUx4 commented Oct 8, 2023

Marking as Matroska v5 as v4 publication is close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants