Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP feat(patterns): pattern-based compression take2 #1584

Draft
wants to merge 1 commit into
base: markm-prepare-for-extended-matchers
Choose a base branch
from

Conversation

erights
Copy link
Contributor

@erights erights commented May 10, 2023

Staged on #2248

closes: #2112
refs: #1564 Agoric/agoric-sdk#6432

Description

Adds two new exports to @endo/patterns

mustCompress(
  specimen: Passable, 
  pattern: Pattern, 
  label?: string|number
) => Passable

and its "inverse"

mustDecompress(
  compressed: Passable,
  pattern: Pattern,
  label?: string|number
) => Passable

(From Agoric/agoric-sdk#6432 (comment) ):

For example without compression, the Zoe proposal

    {
      want: {
        Winnings: {
          brand: moolaBrand,
          value: makeCopyBagFromElements([
            { foo: 'a' },
            { foo: 'b' },
            { foo: 'c' },
          ]),
        },
      },
      give: { Bid: { brand, value: 37n } },
      exit: { afterDeadline: { deadline: 11n, timer } },
    },

is stored with a smallcaps body of

'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'

But it compresses with the proposalShape

    harden({
      want: {
        Winnings: {
          brand: moolaBrand,
          value: M.bagOf(harden({ foo: M.string() }), 1n),
        },
      },
      give: { Bid: { brand, value: M.nat() } },
      exit: { afterDeadline: { deadline: M.gte(10n), timer } },
    })

to

[[['c'], ['b'], ['a']], 37n, 11n]

whose smallcaps body is

'#[[["c"],["b"],["a"]],"+37","+11"]'

which is 12% as long.


It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.

mustCompress is analogous to mustMatch, which as a reminder is

mustMatch(
  specimen :Passable,
  pattern: Pattern,
  label?: string|number
) => void

The following equivalences must hold

  • For all s,p,l1,l2 mustMatch(s,p,l1?) must succeed iff muchCompress(s,p,l2?) succeeds. When they succeed, the label does not matter.
  • For both, they do not succeed by throwing an error with a diagnostic that might use label to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.
  • mustMatch(s,p,l1?) and therefore mustCompress(s,p,l2?) succeeds iff compress(s,p) === true.
  • for all s,p,l,c,s2 mustMatch(s,p,l?) === c iff mustDecompress(c,p,l) === s2 where s and s2 have the same distributed object semantics. compareRank(s, s2) === 0, isKey(s) === isKey(s2), isKey(s) => keyEQ(s,s2)`.

The point is that typically c is smaller than s, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.

Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.

Security Considerations

If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.

Aside from that, none.

Scaling Considerations

The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:

  • encode(mustCompress(data, pattern)) typically takes both less time and less space than
    mustMatch(data, pattern) && encode(data).
  • mustDecompress(decode(encodedCompressedData)) typically takes less time than
    decode(encodedUncompressedData).

This will depend of course on what encode scheme is used.

Documentation Considerations

  • Most of this PR note is worth capturing in documentation in the PR itself

Testing Considerations

Already includes good manual tests.

  • should additionally do fuzzing tests, probably using fastCheck.

Compatibility Considerations

A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like compactOrdered, syrup or cbor.

compactOrdered is both rank equality preserving and rank order preserving. Holding the pattern constant, compactOrdered of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys using compactOrdered on the uncompressed form, forfeiting the opportunity to use keyShape for compression.

Upgrade Considerations

When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.

SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern: keyShape, valueShape, and stateShape. Agoric/agoric-sdk#6432 modifies SwingSet to also use the valueShape and stateShape for compression.

A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like M.bagOf(keyShape, countShape) in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with 1. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.

The M.something(...) matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.

  • [ ] Includes *BREAKING*: in the commit message with migration instructions for any breaking change.

This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.

  • Updates NEWS.md for user-facing changes.

Many of the points made in this PR note should be summarized in a NEWS.md entry.

@erights erights self-assigned this May 10, 2023
@erights erights changed the base branch from master to markm-tag-guards May 10, 2023 06:38
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 241b2d3 to f57ac4b Compare May 10, 2023 06:44
@erights erights force-pushed the markm-pattern-based-compression-2 branch from f57ac4b to 533d62a Compare May 20, 2023 21:45
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 533d62a to 7ce2d16 Compare June 6, 2023 03:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 7ce2d16 to 1025466 Compare August 8, 2023 02:23
@erights erights changed the base branch from markm-tag-guards to markm-tag-guards-2 August 8, 2023 02:24
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 18db466 to accc77c Compare August 8, 2023 02:36
@erights erights force-pushed the markm-tag-guards-2 branch 3 times, most recently from b05871a to 2a13b3d Compare August 9, 2023 02:27
@erights erights force-pushed the markm-pattern-based-compression-2 branch from accc77c to 2e6810f Compare August 9, 2023 02:34
@erights erights changed the base branch from markm-tag-guards-2 to markm-type-guards August 9, 2023 02:35
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 2e6810f to 99b58d6 Compare August 15, 2023 23:02
Base automatically changed from markm-type-guards to master August 21, 2023 22:58
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 282fd46 to b77b6f7 Compare August 28, 2023 05:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from be5d3aa to 3a169ed Compare August 30, 2023 01:23
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 7125ac7 to 061c7e6 Compare September 16, 2023 02:45
@erights erights force-pushed the markm-pattern-based-compression-2 branch 2 times, most recently from 5497b03 to ce825a7 Compare September 26, 2023 03:13
@erights erights force-pushed the markm-pattern-based-compression-2 branch 6 times, most recently from 833067b to 65f26cc Compare April 29, 2024 03:01
@erights erights changed the base branch from master to markm-prepare-for-extended-matchers April 29, 2024 03:02
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 1943903 to 896cae3 Compare April 29, 2024 19:19
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 65f26cc to 46946b7 Compare April 29, 2024 19:25
erights added a commit that referenced this pull request Apr 29, 2024
closes: #XXXX
refs: #2248 #1584 Agoric/agoric-sdk#6432

## Description

Pure refactor. Changes only static info. Mostly more consistent and more
readable use of `@import`.

One case made less readable: Remove newlines within a large `@import`
directive. The reason is that
`yarn lerna run build:types` chokes on those newlines. TODO minimal
repro + report issue.

Extracted from other PRs #1584 #2248 which are now staged on this one.
But this should be a reviewable and mergeable improvement regardless of
whether we move forward on the others.

### Security Considerations

none
### Scaling Considerations

none
### Documentation Considerations

none
### Testing Considerations

none
### Compatibility Considerations

none
### Upgrade Considerations
none

- ~[ ] Includes `*BREAKING*:` in the commit message with migration
instructions for any breaking change.~
- ~[ ] Updates `NEWS.md` for user-facing changes.~
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 896cae3 to 2eafdc2 Compare April 29, 2024 20:20
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 46946b7 to 8e7925b Compare April 29, 2024 20:22
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 2eafdc2 to e3cfbad Compare April 30, 2024 19:40
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 8e7925b to 1f6703c Compare April 30, 2024 19:41
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from e3cfbad to d25cfad Compare May 2, 2024 23:18
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 1f6703c to c50817b Compare May 2, 2024 23:19
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from d25cfad to 5a51499 Compare May 6, 2024 22:17
@erights erights force-pushed the markm-pattern-based-compression-2 branch from c50817b to 5e470dc Compare May 6, 2024 22:18
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 5a51499 to f5b2d72 Compare May 6, 2024 22:22
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 5e470dc to bc39e81 Compare May 6, 2024 22:23
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from f5b2d72 to dd3b3ad Compare May 7, 2024 18:50
@erights erights force-pushed the markm-pattern-based-compression-2 branch from bc39e81 to e104e22 Compare May 7, 2024 18:51
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from dd3b3ad to aa85135 Compare May 7, 2024 21:09
@erights erights force-pushed the markm-pattern-based-compression-2 branch from e104e22 to 35ea462 Compare May 7, 2024 21:09
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from aa85135 to b619239 Compare May 9, 2024 00:09
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 35ea462 to 737d43c Compare May 9, 2024 00:10
@erights erights force-pushed the markm-prepare-for-extended-matchers branch from 964d1ac to 4c7ac33 Compare May 24, 2024 03:41
@erights erights force-pushed the markm-pattern-based-compression-2 branch from 6fae12d to 61d0621 Compare May 24, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need schema-like compression to avoid storing and transmitting redundant data.
1 participant