-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Use orjson in Python zygote #9715
Conversation
@@ -147,7 +151,9 @@ def worker_loop() -> None: | |||
# wait for a single line of input | |||
json_inp = sys.stdin.readline() | |||
# unpack the input line as JSON | |||
inp = json.loads(json_inp, parse_int=zu.safe_parse_int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orjson
doesn't have customizable parsers that would allow us to retain out safe_parse_int
behavior. Is this a dealbreaker?
If this is absolutely necessary, we could instead do this in a postprocessing step where we consider every value and if it's an integer that fails our is_int_json_serializable
check, we replace it with float(value)
. However, a similar operation (assert_all_integers_within_limits
) was actually a pretty large bottleneck, and it was great to be able to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like orjson
might throw an exception if it runs into an oversized int: https://github.com/ijl/orjson?tab=readme-ov-file#int
EDIT: So actually, it seems like this replicates the assert_all_integers_within_limits
for free! You should even be able to throw it into old test cases and see if it works.
EDIT2: Jk, I think I misread this. But doing this pre or post processing might not be that bad. Is there a way to handle this on the serializing for the javascript side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm already taking advantage of the "strict int" behavior for serialization, but it doesn't help us on deserialization where we want to coerce "too big" ints into floats.
I started thinking about doing it in the JS side, but that wouldn't help us because JS can't losslessly represent those ints, which is the root of the problem! This begs the question though: how would we ever find ourselves in a situation where we're trying to parse one of these too-large integers? The only way they could get into params and such is by coming out of the Python worker, but we already forbid that, right?
I suppose these JSON objects can also be modified via CSV uploads, but those also flow through JS. Maybe we just need some extra validation there?
I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should look back at our original notes and PR to see why we felt we needed this special parsing behavior in the first place. Surely we thought about all this before?
Yeah, IIRC there's detailed discussion about these issues, and my recall is definitely flawed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the dev meeting, we do need this behavior, and here's why: #8462 (comment)
I opened an issue to see if they'd consider making this opt-in native behavior: https://redirect.github.com/ijl/orjson/issues/473
try: | ||
zu.assert_all_integers_within_limits(obj) | ||
return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change in behavior: orjson
doesn't support throwing if NaN
is encountered; instead it just serializes it to null
. This is consistent with how JSON.stringify(NaN)
behaves in JavaScript, so I think this is a reasonable change. Just calling this out though.
try: | ||
zu.assert_all_integers_within_limits(obj) | ||
return json.dumps(obj, sort_keys=sort_keys, allow_nan=allow_nan) | ||
dump_start = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dump_start = time.time() | |
dump_start = time.perf_counter() |
All images
|
I never got a response from the |
Recent benchmarking showed that Python's native
json
module, specificallyjson.loads
, was a pretty significant bottleneck for questions that handled megabytes of data inprepare
/parse
/grade
.orjson
appears to be about an order of magnitude faster in these cases, parsing in 10ms instead of 100ms. We also benefit from built-in handling of "too large" integers, which saves us another 100ms or so before parsing.I temporarily added logs with timing information and modified the code caller to print them out instead of producing errors, which has streamlined testing of different options. This should be removed before merging.