feat: add BigDecimal support #91

telezhnaya · 2022-03-22T17:48:46Z

I need it for our new version of Indexer https://github.com/telezhnaya/near-lake-flows-into-sql-base

borsh/Cargo.toml

borsh/src/de/mod.rs

borsh/src/ser/mod.rs

frol

I agree with @matklad regarding using bytes instead of a string representation

borsh/tests/test_big_decimal.rs

telezhnaya · 2022-03-23T13:59:46Z

@matklad @frol thanks a lot for the reviews!
Please have a look again

matklad · 2022-03-23T14:33:07Z

This moves in the right direction, but I think it still doesn’t enforce canonicity/consistency/1-to-1

Consistency means there is a bijective mapping between objects and their binary representations. There is no two binary representations that deserialize into the same object. This is extremely useful for applications that use binary representation to compute hash;

in other words, we care about two properties:

if a == b, then serialized(a) == serialized(b). This one we care super strongly about.

if deserialize(a) == deserialize(b), then a == b. This one we care about strongly, but there are a couple of open bugs about that.

The current implementation seems to violate both. In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison:

https://github.com/akubera/bigdecimal-rs/blob/272123eddc02aad5d1a331ded8eb6605ed2bea96/src/lib.rs#L797

so it seems that we need to do some extra legwork on outside to achieve canonicity. I think what we want to achieve is to use to_le_bytes serialization, and enforce that both the first and the last serializes bytes are non-zero.

borsh/src/de/mod.rs

telezhnaya · 2022-03-23T18:46:26Z

@matklad

In particular, if we look at the impl for Eq for big decimal, we see that it canonicalizes the representation on comparison: ...

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

to use to_le_bytes serialization

You mean to_bytes_le, right?

and enforce that both the first and the last serializes bytes are non-zero.

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

austinabell · 2022-03-23T20:02:05Z

borsh/src/de/mod.rs

+#[cfg(feature = "bigdecimal")]
+impl BorshDeserialize for bigdecimal::num_bigint::BigInt {


I just wonder if this implementation should come from a separate feature and implement it directly from that crate. Just thinking of an edge case where someone might want to have num_bigint impls without having to compile bigdecimal.

Even if we wanted to do this or if someone has an issue, we should be able to do later without it being breaking change, so I think fine as is

borsh/tests/test_big_decimal.rs

austinabell · 2022-03-23T20:12:11Z

oh oops, forgot to note the canonical form of these. Let me look more in depth now

Sorry I can't catch this idea. Do we need non-zero the first and the last bytes because we need to know the borders? What if we have zero byte in the middle of the vector?

The issue is just if there are zeros padding the right side of the le bytes. [1, 0] is the same as [1] and [1, 0, 0]

You can't assert the first byte is non-zero because 256 in le bytes is [0, 1]

austinabell

nullifying approval to look more into canonicalization and also I think the format can be more compact if not using u32 to serialize/deserialize because there will be redundant zero bytes serialized/deserialized. u32 is more optimized for in-memory representation, not to be serialized.

borsh/src/ser/mod.rs

austinabell · 2022-03-23T20:28:46Z

borsh/src/ser/mod.rs

+}
+
+#[cfg(feature = "bigdecimal")]
+impl BorshSerialize for bigdecimal::num_bigint::BigInt {


Noting that if providing BigInt, BigUInt is also a common type used for various reasons, so possibly worth adding that. Can be done in separate PR though (and we can just open an issue if not done here)

borsh/src/de/mod.rs

matklad · 2022-03-23T20:53:21Z

Am I right that eq could be simplified by invoking normalized on both parameters and then just simply comparing the fields?

yup, that would be logically correct but (I think) would be slower than the current approach (level on confidence: 0.6)

telezhnaya · 2022-03-24T15:48:06Z

Please have a look again

borsh/src/ser/mod.rs

telezhnaya · 2022-03-25T06:43:00Z

I added the check you wanted

borsh/src/de/mod.rs

austinabell · 2022-03-25T14:07:02Z

borsh/src/de/mod.rs

+            sign == bigdecimal::num_bigint::Sign::NoSign
+                || (!digits.is_empty() && digits.last().unwrap() > &0)


This check still allows for non-canonical deserialization if the sign is NoSign since in this case, it doesn't check for the padded bytes

Test case I used to verify:

assert_eq!(BigInt::try_from_slice(&[1,0,0,0,0]).unwrap(), BigInt::default()); assert_eq!(BigInt::try_from_slice(&[1,1,0,0,0,0]).unwrap(), BigInt::default()); assert_eq!(BigInt::try_from_slice(&[1,2,0,0,0,0,0]).unwrap(), BigInt::default());

Also, side note, wondering why unsigned varints aren't used for the byle length? Seems weird we are just encoding a u32 (4 bytes) for the length even when it's small? Given most size indicators are small and at max it would be 5 bytes, seems like it would be more compact in most cases? Was this done to minimize complexity and code size rather than serialization size? cc @matklad @frol

Only the last one failed; added this test.

but only one should pass, though. We need a canonical representation of 0. I think it should be [] and not [0] for compactness

Missed the Q about varints, sorry. I don't know the motivation for using u32 rather than varints for lengths. Abstractly, I think I agree that varint would've be a better choice, as that would be more compact, and also would get rid of an error condition when serializing vectors and such.

However, we already have a precedent that we use u32 to encode container lengths, and I think it's important that we use the same repr for bigints.

That is, using varints is more efficient for bigints, but this isn't an important optimiation, because there are far more array & string lengths than bigint lengths.

OTOH, doing something different for bigints adds complexity cost

borsh/src/ser/mod.rs

matklad · 2022-03-29T13:49:16Z

Current state doesn't ensure canonicity for bigints -- there are two ways to encode 0.

I've added a fuzz-test to uncover this case automatically, but it hit overflow in the underlying bigdecimal library.

See the repros in

matklad@a6e50f5

frol · 2022-08-04T11:16:08Z

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

austinabell · 2022-08-04T15:57:08Z

I just want to sum it up for the future: we decided that we don't need BigDecimal support in the indexer, and thus this PR stuck with some of the concerns from the reviewers not being addressed. If someone else needs it in the future, feel free to take it from here.

I think it's worth doing, seems like a common use case that someone would want to use BigIntegers for contracts.

my plan for if I do (make a comment if you disagree with anything):

Make canonical BigUInt 0 integer be serialized as [] ([0,0,0,0] in borsh because of unfortunate u32 length serialization), reject padding
Split feature into num_bigint and bigdecimal and add support for BigUInt
Explore maximums that can be serialized with borsh; consider making it error gracefully if it hits the cap
- Seems like the max of the bigint is usize::MAX*32 bits, but the max that can be serialized is u32::MAX*8 bits (edge case I know but https://github.com/rust-num/num-bigint/blob/63d96021248d6f96e31c53ff69875f7a40a90ea6/src/biguint/convert.rs#L584 where they max the buffer size without erroring seems like the overflow Aleksey hit above)

austinabell · 2022-10-13T03:02:16Z

Okay, so the changes for switching to varints are the commits 21d4857 (#91) and bb625ee (#91)

It's not clear to me the complexity is worth it, and we could compact the number of bytes even further if we wanted (using a full byte for the sign, which should require 2 bits max and maybe could condense the actual value encoding). However, it seems like a clear win for optimizing encoding size. I don't hold opinions on this, so please share any reasons why we shouldn't do this.

matklad · 2022-10-13T09:39:07Z

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

make it easier to review code, by spelling out what it intends to do
borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

austinabell · 2022-10-13T17:16:39Z

@austinabell could you write a public doc comment somewhere which specifies how exactly we serialize things? This is two kill two birds with one stone:

make it easier to review code, by spelling out what it intends to do

borsh is language-independent format, so what we are doing here is not adding support for bigint library to borsh-rs crate, but rather adding support for arbitrary large integers to the borsh format. For the latter, I think it's useful to have docs, so that, eg, js version of borsh can add support for bigints.

I do think that we want to support this capability in the format.

My thinking is that it would only be worth it for big int and not big decimal, would you agree? BigDecimal is just an abstraction on top of existing types, so maybe makes sense for it to live outside of borsh implementations?

Also, because it seems there are bugs in the bigdecimal impl (at least akubera/bigdecimal-rs#94 akubera/bigdecimal-rs#93, but I see others) and it's an opinionated implementation, I'd be a proponent of just removing this from the PR. Technically the same "decimal" number can have multiple serialization formats (since the integer can just be multiplied/divided by 10 for every exponent change), so it feels like we should just avoid this can of worms altogether. One could just add the borsh impls to that library or create their own type for this.

As for the BigUint implementation, it is very similar to bytes deserialization except that the LE bytes length is varint rather than u32 (which bytes probably should have been, to begin with). I can add that specification to https://github.com/near/borsh once it's decided if we want to keep/include bigdecimal.

matklad · 2022-10-14T08:52:24Z

👍on removing bigdecimal from this PR.

frol · 2023-05-25T09:48:55Z

BigDecimal is opinionated and not defined as borsh type yet, so I will close this PR for now until we absolutely need it (same as #109)

feat: add BigDecimal support

6410a92

telezhnaya requested review from frol, matklad and austinabell as code owners March 22, 2022 17:48

matklad reviewed Mar 22, 2022

View reviewed changes

borsh/Cargo.toml Outdated Show resolved Hide resolved

borsh/src/de/mod.rs Outdated Show resolved Hide resolved

borsh/src/de/mod.rs Outdated Show resolved Hide resolved

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved

frol reviewed Mar 22, 2022

View reviewed changes

borsh/tests/test_big_decimal.rs Outdated Show resolved Hide resolved

fixes after review

2b5c39c

frol reviewed Mar 23, 2022

View reviewed changes

borsh/src/de/mod.rs Outdated Show resolved Hide resolved

austinabell approved these changes Mar 23, 2022

View reviewed changes

austinabell suggested changes Mar 23, 2022

View reviewed changes

austinabell reviewed Mar 23, 2022

View reviewed changes

telezhnaya added 3 commits March 24, 2022 15:50

fix imports

bb96c20

use from_bytes_le

7a4089a

remove padding

e206b5f

telezhnaya requested a review from austinabell March 24, 2022 15:48

austinabell reviewed Mar 24, 2022

View reviewed changes

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved

add check

c7096d7

austinabell suggested changes Mar 25, 2022

View reviewed changes

frol reviewed Mar 25, 2022

View reviewed changes

borsh/src/ser/mod.rs Outdated Show resolved Hide resolved

telezhnaya added 2 commits March 29, 2022 15:41

add normalizing, change the condition, add test

6980df3

cleanup

1878b93

frol marked this pull request as draft August 4, 2022 11:16

This was referenced Oct 11, 2022

overflow on PartialEq for BigDecimal akubera/bigdecimal-rs#93

Open

feat: Implement borsh for BigUint and canonicalize other impls #108

Merged

austinabell added 6 commits October 12, 2022 20:46

feat: Implement borsh for BigUint and canonicalize other impls (#108)

7827e85

reorg before implementing varint encoding

93e0a82

optimize exponent serialization

21d4857

optimize bytes decoding for big ints

bb625ee

Merge branch 'master' into olya/bigdecimal

8083d1b

remove extra bytes from canonical test

117591e

austinabell self-requested a review October 13, 2022 02:56

austinabell marked this pull request as ready for review October 13, 2022 03:02

austinabell requested review from matklad and frol October 13, 2022 03:03

austinabell marked this pull request as draft October 18, 2022 15:15

austinabell mentioned this pull request Oct 18, 2022

feat: add BigInt and BigUint support #109

Closed

frol mentioned this pull request Feb 16, 2023

Implement ser(de) for core::time::Duration #131

Closed

LouisGariepy mentioned this pull request May 17, 2023

Use different type to represent DECIMAL than rust_decimals::Decimal cornucopia-rs/cornucopia#204

Open

frol closed this May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BigDecimal support #91

feat: add BigDecimal support #91

telezhnaya commented Mar 22, 2022

frol left a comment

telezhnaya commented Mar 23, 2022

matklad commented Mar 23, 2022

telezhnaya commented Mar 23, 2022

austinabell Mar 23, 2022

austinabell commented Mar 23, 2022

austinabell left a comment •

edited

austinabell Mar 23, 2022

matklad commented Mar 23, 2022

telezhnaya commented Mar 24, 2022

telezhnaya commented Mar 25, 2022

austinabell Mar 25, 2022 •

edited

telezhnaya Mar 29, 2022

austinabell Mar 29, 2022

matklad Oct 14, 2022

matklad commented Mar 29, 2022

frol commented Aug 4, 2022

austinabell commented Aug 4, 2022

austinabell commented Oct 13, 2022

matklad commented Oct 13, 2022

austinabell commented Oct 13, 2022 •

edited

matklad commented Oct 14, 2022

frol commented May 25, 2023

		#[cfg(feature = "bigdecimal")]
		impl BorshDeserialize for bigdecimal::num_bigint::BigInt {

		sign == bigdecimal::num_bigint::Sign::NoSign
		\|\| (!digits.is_empty() && digits.last().unwrap() > &0)

feat: add BigDecimal support #91

feat: add BigDecimal support #91

Conversation

telezhnaya commented Mar 22, 2022

frol left a comment

Choose a reason for hiding this comment

telezhnaya commented Mar 23, 2022

matklad commented Mar 23, 2022

telezhnaya commented Mar 23, 2022

austinabell Mar 23, 2022

Choose a reason for hiding this comment

austinabell commented Mar 23, 2022

austinabell left a comment • edited

Choose a reason for hiding this comment

austinabell Mar 23, 2022

Choose a reason for hiding this comment

matklad commented Mar 23, 2022

telezhnaya commented Mar 24, 2022

telezhnaya commented Mar 25, 2022

austinabell Mar 25, 2022 • edited

Choose a reason for hiding this comment

telezhnaya Mar 29, 2022

Choose a reason for hiding this comment

austinabell Mar 29, 2022

Choose a reason for hiding this comment

matklad Oct 14, 2022

Choose a reason for hiding this comment

matklad commented Mar 29, 2022

frol commented Aug 4, 2022

austinabell commented Aug 4, 2022

austinabell commented Oct 13, 2022

matklad commented Oct 13, 2022

austinabell commented Oct 13, 2022 • edited

matklad commented Oct 14, 2022

frol commented May 25, 2023

austinabell left a comment •

edited

austinabell Mar 25, 2022 •

edited

austinabell commented Oct 13, 2022 •

edited