Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add BigDecimal support #91

Closed
wants to merge 14 commits into from
Closed

feat: add BigDecimal support #91

wants to merge 14 commits into from

Conversation

telezhnaya
Copy link

I need it for our new version of Indexer https://github.com/telezhnaya/near-lake-flows-into-sql-base

borsh/Cargo.toml Outdated Show resolved Hide resolved
borsh/src/de/mod.rs Outdated Show resolved Hide resolved
borsh/src/de/mod.rs Outdated Show resolved Hide resolved
borsh/src/ser/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@frol frol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @matklad regarding using bytes instead of a string representation

borsh/tests/test_big_decimal.rs Outdated Show resolved Hide resolved
@telezhnaya
Copy link
Author

@matklad @frol thanks a lot for the reviews!
Please have a look again

@matklad
Copy link
Contributor

matklad commented Mar 23, 2022

This moves in the right direction, but I think it still doesn’t enforce canonicity/consistency/1-to-1

Consistency means there is a bijective mapping between objects and their binary representations. There is no two binary representations that deserialize into the same object. This is extremely useful for applications that use binary representation to compute hash;

in other words, we care about two properties:

if a == b, then serialized(a) == serialized(b). This one we care super strongly about.

if deserialize(a) == deserialize(b), then a == b. This one we care about strongly, but there are a couple of open bugs about that.

The current implementation seems to violate both. In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison:

https://github.com/akubera/bigdecimal-rs/blob/272123eddc02aad5d1a331ded8eb6605ed2bea96/src/lib.rs#L797

so it seems that we need to do some extra legwork on outside to achieve canonicity. I think what we want to achieve is to use to_le_bytes serialization, and enforce that both the first and the last serializes bytes are non-zero.

borsh/src/de/mod.rs Outdated Show resolved Hide resolved
@telezhnaya
Copy link
Author

@matklad

In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison: ...

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

to use to_le_bytes serialization

You mean to_bytes_le, right?

and enforce that both the first and the last serializes bytes are non-zero.

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

Comment on lines 310 to 311
#[cfg(feature = "bigdecimal")]
impl BorshDeserialize for bigdecimal::num_bigint::BigInt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder if this implementation should come from a separate feature and implement it directly from that crate. Just thinking of an edge case where someone might want to have num_bigint impls without having to compile bigdecimal.

Even if we wanted to do this or if someone has an issue, we should be able to do later without it being breaking change, so I think fine as is

borsh/tests/test_big_decimal.rs Outdated Show resolved Hide resolved
@austinabell
Copy link
Contributor

oh oops, forgot to note the canonical form of these. Let me look more in depth now

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

The issue is just if there are zeros padding the right side of the le bytes. [1, 0] is the same as [1] and [1, 0, 0]

You can't assert the first byte is non-zero because 256 in le bytes is [0, 1]

Copy link
Contributor

@austinabell austinabell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullifying approval to look more into canonicalization and also I think the format can be more compact if not using u32 to serialize/deserialize because there will be redundant zero bytes serialized/deserialized. u32 is more optimized for in-memory representation, not to be serialized.

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved
}

#[cfg(feature = "bigdecimal")]
impl BorshSerialize for bigdecimal::num_bigint::BigInt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that if providing BigInt, BigUInt is also a common type used for various reasons, so possibly worth adding that. Can be done in separate PR though (and we can just open an issue if not done here)

borsh/src/de/mod.rs Outdated Show resolved Hide resolved
@matklad
Copy link
Contributor

matklad commented Mar 23, 2022

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

yup, that would be logically correct but (I think) would be slower than the current approach (level on confidence: 0.6)

@telezhnaya
Copy link
Author

Please have a look again

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved
@telezhnaya
Copy link
Author

I added the check you wanted

borsh/src/de/mod.rs Outdated Show resolved Hide resolved
Comment on lines 314 to 315
sign == bigdecimal::num_bigint::Sign::NoSign
|| (!digits.is_empty() && digits.last().unwrap() > &0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check still allows for non-canonical deserialization if the sign is NoSign since in this case, it doesn't check for the padded bytes

Test case I used to verify:

	assert_eq!(BigInt::try_from_slice(&[1,0,0,0,0]).unwrap(), BigInt::default());
	assert_eq!(BigInt::try_from_slice(&[1,1,0,0,0,0]).unwrap(), BigInt::default());
    assert_eq!(BigInt::try_from_slice(&[1,2,0,0,0,0,0]).unwrap(), BigInt::default());

Also, side note, wondering why unsigned varints aren't used for the byle length? Seems weird we are just encoding a u32 (4 bytes) for the length even when it's small? Given most size indicators are small and at max it would be 5 bytes, seems like it would be more compact in most cases? Was this done to minimize complexity and code size rather than serialization size? cc @matklad @frol

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the last one failed; added this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but only one should pass, though. We need a canonical representation of 0. I think it should be [] and not [0] for compactness

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the Q about varints, sorry. I don't know the motivation for using u32 rather than varints for lengths. Abstractly, I think I agree that varint would've be a better choice, as that would be more compact, and also would get rid of an error condition when serializing vectors and such.

However, we already have a precedent that we use u32 to encode container lengths, and I think it's important that we use the same repr for bigints.

That is, using varints is more efficient for bigints, but this isn't an important optimiation, because there are far more array & string lengths than bigint lengths.

OTOH, doing something different for bigints adds complexity cost

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved
@matklad
Copy link
Contributor

matklad commented Mar 29, 2022

Current state doesn't ensure canonicity for bigints -- there are two ways to encode 0.

I've added a fuzz-test to uncover this case automatically, but it hit overflow in the underlying bigdecimal library.

See the repros in

matklad@a6e50f5

@frol
Copy link
Collaborator

frol commented Aug 4, 2022

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

@frol frol marked this pull request as draft August 4, 2022 11:16
@austinabell
Copy link
Contributor

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

I think it's worth doing, seems like a common use case that someone would want to use BigIntegers for contracts.

my plan for if I do (make a comment if you disagree with anything):

  • Make canonical BigUInt 0 integer be serialized as [] ([0,0,0,0] in borsh because of unfortunate u32 length serialization), reject padding
  • Split feature into num_bigint and bigdecimal and add support for BigUInt
  • Explore maximums that can be serialized with borsh; consider making it error gracefully if it hits the cap

@austinabell austinabell self-requested a review October 13, 2022 02:56
@austinabell
Copy link
Contributor

Okay, so the changes for switching to varints are the commits 21d4857 (#91) and bb625ee (#91)

It's not clear to me the complexity is worth it, and we could compact the number of bytes even further if we wanted (using a full byte for the sign, which should require 2 bits max and maybe could condense the actual value encoding). However, it seems like a clear win for optimizing encoding size. I don't hold opinions on this, so please share any reasons why we shouldn't do this.

@austinabell austinabell marked this pull request as ready for review October 13, 2022 03:02
@matklad
Copy link
Contributor

matklad commented Oct 13, 2022

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

  • make it easier to review code, by spelling out what it intends to do
  • borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

@austinabell
Copy link
Contributor

austinabell commented Oct 13, 2022

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

  • make it easier to review code, by spelling out what it intends to do
  • borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

My thinking is that it would only be worth it for big int and not big decimal, would you agree? BigDecimal is just an abstraction on top of existing types, so maybe makes sense for it to live outside of borsh implementations?

Also, because it seems there are bugs in the bigdecimal impl (at least akubera/bigdecimal-rs#94 akubera/bigdecimal-rs#93, but I see others) and it's an opinionated implementation, I'd be a proponent of just removing this from the PR. Technically the same "decimal" number can have multiple serialization formats (since the integer can just be multiplied/divided by 10 for every exponent change), so it feels like we should just avoid this can of worms altogether. One could just add the borsh impls to that library or create their own type for this.

As for the BigUint implementation, it is very similar to bytes deserialization except that the LE bytes length is varint rather than u32 (which bytes probably should have been, to begin with). I can add that specification to https://github.com/near/borsh once it's decided if we want to keep/include bigdecimal.

@matklad
Copy link
Contributor

matklad commented Oct 14, 2022

👍on removing bigdecimal from this PR.

@frol
Copy link
Collaborator

frol commented May 25, 2023

BigDecimal is opinionated and not defined as borsh type yet, so I will close this PR for now until we absolutely need it (same as #109)

@frol frol closed this May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants