-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add BigDecimal support #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @matklad regarding using bytes instead of a string representation
This moves in the right direction, but I think it still doesn’t enforce canonicity/consistency/1-to-1
in other words, we care about two properties: if a == b, then serialized(a) == serialized(b). This one we care super strongly about. if deserialize(a) == deserialize(b), then a == b. This one we care about strongly, but there are a couple of open bugs about that. The current implementation seems to violate both. In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison: so it seems that we need to do some extra legwork on outside to achieve canonicity. I think what we want to achieve is to use to_le_bytes serialization, and enforce that both the first and the last serializes bytes are non-zero. |
Am I right that
You mean
Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector? |
borsh/src/de/mod.rs
Outdated
#[cfg(feature = "bigdecimal")] | ||
impl BorshDeserialize for bigdecimal::num_bigint::BigInt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wonder if this implementation should come from a separate feature and implement it directly from that crate. Just thinking of an edge case where someone might want to have num_bigint
impls without having to compile bigdecimal
.
Even if we wanted to do this or if someone has an issue, we should be able to do later without it being breaking change, so I think fine as is
oh oops, forgot to note the canonical form of these. Let me look more in depth now
The issue is just if there are zeros padding the right side of the You can't assert the first byte is non-zero because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullifying approval to look more into canonicalization and also I think the format can be more compact if not using u32
to serialize/deserialize because there will be redundant zero bytes serialized/deserialized. u32
is more optimized for in-memory representation, not to be serialized.
borsh/src/ser/mod.rs
Outdated
} | ||
|
||
#[cfg(feature = "bigdecimal")] | ||
impl BorshSerialize for bigdecimal::num_bigint::BigInt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that if providing BigInt
, BigUInt
is also a common type used for various reasons, so possibly worth adding that. Can be done in separate PR though (and we can just open an issue if not done here)
yup, that would be logically correct but (I think) would be slower than the current approach (level on confidence: 0.6) |
Please have a look again |
I added the check you wanted |
borsh/src/de/mod.rs
Outdated
sign == bigdecimal::num_bigint::Sign::NoSign | ||
|| (!digits.is_empty() && digits.last().unwrap() > &0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check still allows for non-canonical deserialization if the sign is NoSign
since in this case, it doesn't check for the padded bytes
Test case I used to verify:
assert_eq!(BigInt::try_from_slice(&[1,0,0,0,0]).unwrap(), BigInt::default());
assert_eq!(BigInt::try_from_slice(&[1,1,0,0,0,0]).unwrap(), BigInt::default());
assert_eq!(BigInt::try_from_slice(&[1,2,0,0,0,0,0]).unwrap(), BigInt::default());
Also, side note, wondering why unsigned varints aren't used for the byle length? Seems weird we are just encoding a u32 (4 bytes) for the length even when it's small? Given most size indicators are small and at max it would be 5 bytes, seems like it would be more compact in most cases? Was this done to minimize complexity and code size rather than serialization size? cc @matklad @frol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the last one failed; added this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but only one should pass, though. We need a canonical representation of 0. I think it should be []
and not [0]
for compactness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed the Q about varints, sorry. I don't know the motivation for using u32 rather than varints for lengths. Abstractly, I think I agree that varint would've be a better choice, as that would be more compact, and also would get rid of an error condition when serializing vectors and such.
However, we already have a precedent that we use u32
to encode container lengths, and I think it's important that we use the same repr for bigints.
That is, using varints is more efficient for bigints, but this isn't an important optimiation, because there are far more array & string lengths than bigint lengths.
OTOH, doing something different for bigints adds complexity cost
Current state doesn't ensure canonicity for bigints -- there are two ways to encode 0. I've added a fuzz-test to uncover this case automatically, but it hit overflow in the underlying bigdecimal library. See the repros in |
I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here. |
I think it's worth doing, seems like a common use case that someone would want to use BigIntegers for contracts. my plan for if I do (make a comment if you disagree with anything):
|
Okay, so the changes for switching to varints are the commits It's not clear to me the complexity is worth it, and we could compact the number of bytes even further if we wanted (using a full byte for the sign, which should require 2 bits max and maybe could condense the actual value encoding). However, it seems like a clear win for optimizing encoding size. I don't hold opinions on this, so please share any reasons why we shouldn't do this. |
@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:
I do think that we want to support this capability in the format. |
My thinking is that it would only be worth it for big int and not big decimal, would you agree? BigDecimal is just an abstraction on top of existing types, so maybe makes sense for it to live outside of borsh implementations? Also, because it seems there are bugs in the bigdecimal impl (at least akubera/bigdecimal-rs#94 akubera/bigdecimal-rs#93, but I see others) and it's an opinionated implementation, I'd be a proponent of just removing this from the PR. Technically the same "decimal" number can have multiple serialization formats (since the integer can just be multiplied/divided by 10 for every exponent change), so it feels like we should just avoid this can of worms altogether. One could just add the borsh impls to that library or create their own type for this. As for the BigUint implementation, it is very similar to bytes deserialization except that the LE bytes length is varint rather than |
👍on removing bigdecimal from this PR. |
BigDecimal is opinionated and not defined as borsh type yet, so I will close this PR for now until we absolutely need it (same as #109) |
I need it for our new version of Indexer https://github.com/telezhnaya/near-lake-flows-into-sql-base