WIP feat(patterns): pattern-based compression take2 #1584

erights · 2023-05-10T06:37:07Z

Staged on #2248

closes: #2112
refs: #1564 Agoric/agoric-sdk#6432

Description

Adds two new exports to @endo/patterns

mustCompress(
  specimen: Passable, 
  pattern: Pattern, 
  label?: string|number
) => Passable

and its "inverse"

mustDecompress(
  compressed: Passable,
  pattern: Pattern,
  label?: string|number
) => Passable

(From Agoric/agoric-sdk#6432 (comment) ):

For example without compression, the Zoe proposal

    {
      want: {
        Winnings: {
          brand: moolaBrand,
          value: makeCopyBagFromElements([
            { foo: 'a' },
            { foo: 'b' },
            { foo: 'c' },
          ]),
        },
      },
      give: { Bid: { brand, value: 37n } },
      exit: { afterDeadline: { deadline: 11n, timer } },
    },

is stored with a smallcaps body of

'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'

But it compresses with the proposalShape

    harden({
      want: {
        Winnings: {
          brand: moolaBrand,
          value: M.bagOf(harden({ foo: M.string() }), 1n),
        },
      },
      give: { Bid: { brand, value: M.nat() } },
      exit: { afterDeadline: { deadline: M.gte(10n), timer } },
    })

to

[[['c'], ['b'], ['a']], 37n, 11n]

whose smallcaps body is

'#[[["c"],["b"],["a"]],"+37","+11"]'

which is 12% as long.

It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.

mustCompress is analogous to mustMatch, which as a reminder is

mustMatch(
  specimen :Passable,
  pattern: Pattern,
  label?: string|number
) => void

The following equivalences must hold

For all s,p,l1,l2 mustMatch(s,p,l1?) must succeed iff muchCompress(s,p,l2?) succeeds. When they succeed, the label does not matter.
For both, they do not succeed by throwing an error with a diagnostic that might use label to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.
mustMatch(s,p,l1?) and therefore mustCompress(s,p,l2?) succeeds iff compress(s,p) === true.
for all s,p,l,c,s2 mustMatch(s,p,l?) === c iff mustDecompress(c,p,l) === s2 where s and s2 have the same distributed object semantics. compareRank(s, s2) === 0, isKey(s) === isKey(s2), isKey(s) => keyEQ(s,s2)`.

The point is that typically c is smaller than s, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.

Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.

Security Considerations

If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.

Aside from that, none.

Scaling Considerations

The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:

encode(mustCompress(data, pattern)) typically takes both less time and less space than
mustMatch(data, pattern) && encode(data).
mustDecompress(decode(encodedCompressedData)) typically takes less time than
decode(encodedUncompressedData).

This will depend of course on what encode scheme is used.

Documentation Considerations

Most of this PR note is worth capturing in documentation in the PR itself

Testing Considerations

Already includes good manual tests.

should additionally do fuzzing tests, probably using fastCheck.

Compatibility Considerations

A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like compactOrdered, syrup or cbor.

compactOrdered is both rank equality preserving and rank order preserving. Holding the pattern constant, compactOrdered of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys using compactOrdered on the uncompressed form, forfeiting the opportunity to use keyShape for compression.

Upgrade Considerations

When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.

SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern: keyShape, valueShape, and stateShape. Agoric/agoric-sdk#6432 modifies SwingSet to also use the valueShape and stateShape for compression.

A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like M.bagOf(keyShape, countShape) in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with 1. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.

The M.something(...) matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.

~~[ ] Includes *BREAKING*: in the commit message with migration instructions for any breaking change.~~

This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.

Updates NEWS.md for user-facing changes.

Many of the points made in this PR note should be summarized in a NEWS.md entry.

closes: #XXXX refs: #2248 #1584 Agoric/agoric-sdk#6432 ## Description Pure refactor. Changes only static info. Mostly more consistent and more readable use of `@import`. One case made less readable: Remove newlines within a large `@import` directive. The reason is that `yarn lerna run build:types` chokes on those newlines. TODO minimal repro + report issue. Extracted from other PRs #1584 #2248 which are now staged on this one. But this should be a reviewable and mergeable improvement regardless of whether we move forward on the others. ### Security Considerations none ### Scaling Considerations none ### Documentation Considerations none ### Testing Considerations none ### Compatibility Considerations none ### Upgrade Considerations none - ~[ ] Includes `*BREAKING*:` in the commit message with migration instructions for any breaking change.~ - ~[ ] Updates `NEWS.md` for user-facing changes.~

erights self-assigned this May 10, 2023

erights changed the base branch from master to markm-tag-guards May 10, 2023 06:38

erights force-pushed the markm-pattern-based-compression-2 branch from 241b2d3 to f57ac4b Compare May 10, 2023 06:44

erights mentioned this pull request May 12, 2023

feat(patterns): pattern-based compression #1564

Closed

erights force-pushed the markm-tag-guards branch from 358d9fa to 9c24c19 Compare May 20, 2023 21:43

erights force-pushed the markm-pattern-based-compression-2 branch from f57ac4b to 533d62a Compare May 20, 2023 21:45

erights force-pushed the markm-tag-guards branch from 9c24c19 to 91d36e7 Compare June 6, 2023 03:20

erights force-pushed the markm-pattern-based-compression-2 branch from 533d62a to 7ce2d16 Compare June 6, 2023 03:22

erights force-pushed the markm-pattern-based-compression-2 branch from 7ce2d16 to 1025466 Compare August 8, 2023 02:23

erights changed the base branch from markm-tag-guards to markm-tag-guards-2 August 8, 2023 02:24

erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 18db466 to accc77c Compare August 8, 2023 02:36

erights mentioned this pull request Aug 8, 2023

WIP feat: use pattern-based compression Agoric/agoric-sdk#6432

Draft

erights force-pushed the markm-tag-guards-2 branch 3 times, most recently from b05871a to 2a13b3d Compare August 9, 2023 02:27

erights force-pushed the markm-pattern-based-compression-2 branch from accc77c to 2e6810f Compare August 9, 2023 02:34

erights changed the base branch from markm-tag-guards-2 to markm-type-guards August 9, 2023 02:35

erights force-pushed the markm-type-guards branch from a0170df to 505f81f Compare August 15, 2023 22:53

erights force-pushed the markm-pattern-based-compression-2 branch from 2e6810f to 99b58d6 Compare August 15, 2023 23:02

erights force-pushed the markm-type-guards branch from 505f81f to c2cd034 Compare August 21, 2023 22:48

Base automatically changed from markm-type-guards to master August 21, 2023 22:58

erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 282fd46 to b77b6f7 Compare August 28, 2023 05:22

erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from be5d3aa to 3a169ed Compare August 30, 2023 01:23

erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 7125ac7 to 061c7e6 Compare September 16, 2023 02:45

erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 5497b03 to ce825a7 Compare September 26, 2023 03:13

erights force-pushed the markm-pattern-based-compression-2 branch 6 times, most recently from 833067b to 65f26cc Compare April 29, 2024 03:01

erights changed the base branch from master to markm-prepare-for-extended-matchers April 29, 2024 03:02

erights force-pushed the markm-prepare-for-extended-matchers branch from 1943903 to 896cae3 Compare April 29, 2024 19:19

erights force-pushed the markm-pattern-based-compression-2 branch from 65f26cc to 46946b7 Compare April 29, 2024 19:25

erights mentioned this pull request Apr 29, 2024

refactor(patterns): improve pattern types. Flatten @imports #2256

Merged

erights force-pushed the markm-prepare-for-extended-matchers branch from 896cae3 to 2eafdc2 Compare April 29, 2024 20:20

erights force-pushed the markm-pattern-based-compression-2 branch from 46946b7 to 8e7925b Compare April 29, 2024 20:22

erights force-pushed the markm-prepare-for-extended-matchers branch from 2eafdc2 to e3cfbad Compare April 30, 2024 19:40

erights force-pushed the markm-pattern-based-compression-2 branch from 8e7925b to 1f6703c Compare April 30, 2024 19:41

erights force-pushed the markm-prepare-for-extended-matchers branch from e3cfbad to d25cfad Compare May 2, 2024 23:18

erights force-pushed the markm-pattern-based-compression-2 branch from 1f6703c to c50817b Compare May 2, 2024 23:19

erights force-pushed the markm-prepare-for-extended-matchers branch from d25cfad to 5a51499 Compare May 6, 2024 22:17

erights force-pushed the markm-pattern-based-compression-2 branch from c50817b to 5e470dc Compare May 6, 2024 22:18

erights force-pushed the markm-prepare-for-extended-matchers branch from 5a51499 to f5b2d72 Compare May 6, 2024 22:22

erights force-pushed the markm-pattern-based-compression-2 branch from 5e470dc to bc39e81 Compare May 6, 2024 22:23

erights force-pushed the markm-prepare-for-extended-matchers branch from f5b2d72 to dd3b3ad Compare May 7, 2024 18:50

erights force-pushed the markm-pattern-based-compression-2 branch from bc39e81 to e104e22 Compare May 7, 2024 18:51

erights force-pushed the markm-prepare-for-extended-matchers branch from dd3b3ad to aa85135 Compare May 7, 2024 21:09

erights force-pushed the markm-pattern-based-compression-2 branch from e104e22 to 35ea462 Compare May 7, 2024 21:09

erights force-pushed the markm-prepare-for-extended-matchers branch from aa85135 to b619239 Compare May 9, 2024 00:09

erights force-pushed the markm-pattern-based-compression-2 branch from 35ea462 to 737d43c Compare May 9, 2024 00:10

erights force-pushed the markm-prepare-for-extended-matchers branch from 964d1ac to 4c7ac33 Compare May 24, 2024 03:41

feat(patterns): pattern-based compression

61d0621

erights force-pushed the markm-pattern-based-compression-2 branch from 6fae12d to 61d0621 Compare May 24, 2024 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP feat(patterns): pattern-based compression take2 #1584

WIP feat(patterns): pattern-based compression take2 #1584

erights commented May 10, 2023 •

edited

WIP feat(patterns): pattern-based compression take2 #1584

Are you sure you want to change the base?

WIP feat(patterns): pattern-based compression take2 #1584

Conversation

erights commented May 10, 2023 • edited

Description

Security Considerations

Scaling Considerations

Documentation Considerations

Testing Considerations

Compatibility Considerations

Upgrade Considerations

erights commented May 10, 2023 •

edited