Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom hash for intoto payload #1018

Closed
laurentsimon opened this issue May 14, 2024 · 14 comments
Closed

Custom hash for intoto payload #1018

laurentsimon opened this issue May 14, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@laurentsimon
Copy link
Contributor

laurentsimon commented May 14, 2024

The current implement hardcodes a list of hashes https://github.com/sigstore/sigstore-python/blob/main/sigstore/dsse.py#L35.
Intoto allows any hashes, so I think we need to support use cases where users use other hashes.

Would removing the list of hardcoded values be enough?

I'm happy to work on this.

/cc @mihaimaruseac

@laurentsimon laurentsimon added the enhancement New feature or request label May 14, 2024
@woodruffw
Copy link
Member

Would removing the list of hardcoded values be enough?

My read of the DigestSet spec is that it contains a lot of things that don't belong in modern cryptography, like GOST, SHA1, and MD5. I understand their rationale for doing this in the context of a generic file-matching tool, but exposing anything (roughly) weaker than SHA-2/256 in the context of digital signatures is IMO risky.

So I don't want to remove the list outright, but I would be fine with adding to it, so long as the amendments are strong general-purpose cryptographic digests 🙂. Do you have specific additions that you want to make?

@laurentsimon
Copy link
Contributor Author

laurentsimon commented May 14, 2024

Would removing the list of hardcoded values be enough?

My read of the DigestSet spec is that it contains a lot of things that don't belong in modern cryptography, like GOST, SHA1, and MD5. I understand their rationale for doing this in the context of a generic file-matching tool, but exposing anything (roughly) weaker than SHA-2/256 in the context of digital signatures is IMO risky.

I definitely understand that. In fact they also allow non-cryptographic hashes, so long as the "digest" is "immutable".

So I don't want to remove the list outright, but I would be fine with adding to it, so long as the amendments are strong general-purpose cryptographic digests 🙂. Do you have specific additions that you want to make?

Our use case is as follows: for model signing, we speed up hash calculation by splitting large files into multiple "shards" of X bytes each. Then we compute the hash on each shard in parallel; and aggregate them via sha256(h1 || h2 || ...). So our hash is parameterized by the shard size (and the version of the construction). We're currently calling it sha256p$<version>$<shard-size>

What would be a good way forward to support this "family" of hashes? I could see a "subtle" API like in Tink, ie an API for people who know what they are doing (this way you need not be involved if someone wants to use a custom hash, eg blake, a serialization constuction using blake, etc). Or we simply add the sha256p$ prefix to the allow list. Wdut?

@woodruffw
Copy link
Member

What would be a good way forward to support this "family" of hashes?

I think this unfortunately ends up being somewhat complex 😅 -- sigstore-python is trying to closely align with other Sigstore clients in terms of the signature schemes/algorithmic suites we support, so I think we'd need sha256p$* to go through a standards process (not anything heavyweight like IETF, but inclusion in the protobuf-specs at a minimum) before we could begin adding it here.

But to take a step back: one of the "advantages" to DSSE is that the payload is flexible. So rather than encoding a new hash at the in-toto layer, maybe you could communicate this at the payload layer?

In other words: take a final image by doing dgst := sha256(sha256p$<version>$<shard-size>(input)), and commit to dgst in the in-toto subject's digest. Then, add a new field to the subject encoding the version and shard-size. With that, you'd end up with something like this:

{
  "_type": "https://in-toto.io/Statement/v1",
  "subject": [
    {
      "name": "blahblahblahmodel",
      "digest": {"sha256": "dgst"},
      "x-sha256p-params": {
        "version": 1,
        "shard-size": 4096,
      },
    },
  ],
  "predicateType": "blahblahblah",
  "predicate": { ... }
}

From there, your verification procedure would be:

  1. Perform DSSE verification as normal via the sigstore-python APIs, getting back the raw in-toto JSON
  2. Decode the JSON, and, for each subject:
    1. Check for the presence of x-sha256p-params
    2. Use those parameters to compute expected := sha256(sha256p$<version>$<shard-size>(input))
    3. Compare dgst == expected

I believe this would accomodate your use case, without requiring a new algorithmic selection at the Sigstore layer. The only "gotcha" would be that this won't work with the current sigstore-python CLI, just the APIs, and it'll be pretty ecosystem-specific.

(Longer term however, we could possibly make it compatible with the sigstore-python CLI by allowing sigstore verify ... sha256:dgst as an alternative to a file input.)

@laurentsimon
Copy link
Contributor Author

Yes this solution would also work. I don't have a strong preference of what it ends up looking like, so long as the parameters and hash function are available. What you describe works, but feels like a hack to disguise the hash as sha256 while actually being something else. Also, it does not make it any safer, ie if the construction using sha256 is wrong, you end up with something weaker than sha256.

I'll let @haydentherapper comment on the Sigstore side and the protobuf inclusion. I don't think this is needed, but I'll let the maintainers decide.

@woodruffw
Copy link
Member

What you describe works, but feels like a hack to disguise the hash as sha256 while actually being something else. Also, it does not make it any safer, ie if the construction using sha256 is wrong, you end up with something weaker than sha256.

Yep, it's certainly a hack. I also don't know if it's a good idea 🙂 -- it's just what came to mind as a solution that doesn't require Sigstore to be aware of the underlying construction.

(Unless I'm mistaken, the only way this can be weaker than SHA256 itself is if SHA256p's construction has a weakness, or an attacker could select weak parameters. But this would be true even if it wasn't wrapped in SHA256 at the end, I think?)

@laurentsimon
Copy link
Contributor Author

What you describe works, but feels like a hack to disguise the hash as sha256 while actually being something else. Also, it does not make it any safer, ie if the construction using sha256 is wrong, you end up with something weaker than sha256.

Yep, it's certainly a hack. I also don't know if it's a good idea 🙂 -- it's just what came to mind as a solution that doesn't require Sigstore to be aware of the underlying construction.

Agreed with all this.

(Unless I'm mistaken, the only way this can be weaker than SHA256 itself is if SHA256p's construction has a weakness, or an attacker could select weak parameters. But this would be true even if it wasn't wrapped in SHA256 at the end, I think?)

Sorry, I just meant that in general, one way it could be weaker is if the construction is wrong. I think sha256p works as a construction, but I'm sure someone could come up with one that is not.

@woodruffw
Copy link
Member

Sorry, I just meant that in general, one way it could be weaker is if the construction is wrong. I think sha256p works as a construction, but I'm sure someone could come up with one that is not.

Ah, gotcha! Yeah, that's a great point -- this is sort of opening the gates/setting the precedent for people building hidden constructions of unknown quality. That's not great...

As another random thought: there are strong, modern hash constructions that are embarrassingly parallel, like BLAKE3. Could you use one of these? I think it'd be much easier to make a strong argument for BLAKE3's inclusion as a supported digest, especially since it has a wealth of public implementations.

@laurentsimon
Copy link
Contributor Author

BLAKE3: great idea. We've had this tracking issue for a while sigstore/model-transparency#13, did not realize BLAKE3 does sort of what we're doing, also similar to dm-crypt. (tree-based hashing)

@laurentsimon
Copy link
Contributor Author

One thing to be aware of is that blake2 is not a NIST hash, and blake3 seems based on a similar idea (reduced number of rounds in the compression function). It also fixes the shard size to 1KB, which we would likely want to be parameterizable in practice. If those were possible I think it'd do exactly what we want

@woodruffw
Copy link
Member

One thing to be aware of is that blake2 is not a NIST hash, and blake3 seems based on a similar idea (reduced number of rounds in the compression function).

Yeah, my recollection is that Blake2 was part of the SHA-3 competition but wasn't selected, while Blake3 postdates the competition by a few years. But both are widely adopted and considered strong constructions.

It also fixes the shard size to 1KB, which we would likely want to be parameterizable in practice. If those were possible I think it'd do exactly what we want

Just for my understanding: do you need to parametrize the shard size for performance reasons, or something else?

(I don't know your performance constraints, but I'm wondering if the perfect might be the enemy of the good here 🙂 -- even Blake2b is really fast and much easier to add here + get integrated into the specs versus a new custom parameter set.)

@laurentsimon
Copy link
Contributor Author

laurentsimon commented May 15, 2024

Just for my understanding: do you need to parametrize the shard size for performance reasons, or something else?

yes. We suspect some teams will want to optimize it for their own machines. We need to benchmark blake3. Their evaluation seems to be on a small 16KB stream and does not read from disk so they are only CPU bounded in their tests. Maybe blake3 is good enough for a first release. But we may still need a way to allow for custumizing parameters in the future, so having a clear path for it would be useful. Thanks again for the tip / suggestions

@haydentherapper
Copy link
Contributor

We should specify the hash in the protobuf-specs repo, because this is effectively the client spec-as-code. Ideally, each client would support every set of hashes specified in the protobuf-specs repo, or at least we would have a way to determine what features each client support (sigstore/sigstore-conformance#144). Specifying this in the repo is also our way to drive consensus across clients, again because I'd like to see eventual consistency across clients for the algorithms (signature and hash) they support.

Personally, I would prefer to see more hashes supported rather than the digest wrapping.

Also this wasn't asked, but note that Rekor should be fine with any intoto payload.

As an aside, there is an argument for hashedrekord being the only structure across Sigstore clients, and each client should provide some sort of plugin system that allows users to specify how to canonicalize their input and how to verify the artifact, with clients providing defaults for binaries and DSSEs (or jars, helm charts, etc). But for now, hashed artifacts + DSSEs is a good state to be in.

@laurentsimon
Copy link
Contributor Author

laurentsimon commented May 15, 2024

We should specify the hash in the protobuf-specs repo, because this is effectively the client spec-as-code. Ideally, each client would support every set of hashes specified in the protobuf-specs repo, or at least we would have a way to determine what features each client support (sigstore/sigstore-conformance#144). Specifying this in the repo is also our way to drive consensus across clients, again because I'd like to see eventual consistency across clients for the algorithms (signature and hash) they support.

Another consideration is that the model signing library won't only support sigstore. I think we want to be able to support other digests that companies might want to use internally (even if we don't know about them), while allowing them to re-use all the rest of the model-signing library. That probably just means that Sigstore will only support a pre-defined set of hashes, but that other signing (PKISigner, etc) should be able to use their own (advanced user mode)

Also this wasn't asked, but note that Rekor should be fine with any intoto payload.

That's great!

@woodruffw
Copy link
Member

Closing in favor of #982 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants