perf(parser): use faster string parser methods #8227

sno2 · 2023-10-25T20:41:51Z

Summary

This makes use of memchr and other methods to parse the strings (hopefully) faster. It might also be worth converting the parse_fstring_middle helper to use similar techniques, but I did not implement it in this PR.

Test Plan

This was tested using the existing tests and passed all of them.

This makes use of memchr for parsing strings. It sadly does introduce one use of `unsafe` to create a string that is valid to pass into `u32::from_str_radix` because I was unable to find another method that does not require far more code than required with `unsafe`.

codspeed-hq · 2023-10-25T20:55:24Z

CodSpeed Performance Report

Merging #8227 will improve performances by 22.78%

_{Comparing sno2:perf/parser-string (28b823f) with main (9792b15)}

Summary

⚡ 5 improvements
✅ 20 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`sno2:perf/parser-string`	Change
⚡	`linter/all-rules[numpy/globals.py]`	4.1 ms	3.9 ms	+6.14%
⚡	`parser[unicode/pypinyin.py]`	4.1 ms	3.9 ms	+4.83%
⚡	`parser[numpy/ctypeslib.py]`	12 ms	11.2 ms	+7.36%
⚡	`parser[numpy/globals.py]`	1.3 ms	1.1 ms	+22.78%
⚡	`linter/default-rules[numpy/globals.py]`	2 ms	1.7 ms	+15.48%

github-actions · 2023-10-25T21:14:36Z

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

MichaReiser · 2023-10-26T00:19:25Z

Wow, that's amazing. We had it on our bucket list to rewrite the String parsing to use our Cursor implementation that is also used by the Lexer and should be easier to optimize by the compiler.

I hope to find some time soon to review this PR.

charliermarsh · 2023-10-26T00:35:29Z

This is really cool, thank you for putting this together.

multi-byte UTF-8 characters

sno2 · 2023-10-26T00:55:21Z

Thank you, I also noticed another panic while looking through the code so hopefully there won't be any more panics in here :)

MichaReiser

Excellent work! And it's good to see how much potential there still is to improve our parser.

I would prefer if we could split the memchr usage out of this PR and submit it as its own PR to better assess whether replacing find with memchr is worth it.

It would be nice if we could explore using Cursor for StringParser as part of another PR. Cursor is what we use in the Lexer and other places where we need to parse text. Using Cursor everywhere has the benefit that maintainers are familiar with it, simplifying code reviews and code maintenance.

crates/ruff_python_parser/src/string.rs

MichaReiser · 2023-10-27T01:33:57Z

crates/ruff_python_parser/src/string.rs


-        if name.len() > MAX_UNICODE_NAME {


Could you explain why this check is no longer necessary? Is it because the optimisation (never was) is no longer necessary because the operation above is so fast and unicode_names2::character handles it for us?

It seemed like a code smell to me- I did not understand why we should optimize for a fail state as obscure as a unicode escape name > 80 characters.

For reference, the relevant issue: RustPython/RustPython#3798

The constant value is now publicly available: https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.

use unicode_names2::MAX_NAME_LENGTH;

I'd suggest we re-add -- costs us very little (nothing?) and gives us an error rather than a panic, if I understand this conversation correctly.

I have validated that the error does not exist by testing the previous reproduction. The issue was fixed in the crate here progval/unicode_names2@9404fb6 (Note that it is included in the 1.2.0 tag that we are using)

I did not realize that the motivation was to fix a previous panic in the crate and not a performance trick. Therefore, should we be fine not adding in the magic constants again?

Excellent -- thank you for testing this.

crates/ruff_python_parser/src/string.rs

sno2 · 2023-10-27T02:41:41Z

It would be nice if we could explore using Cursor for StringParser as part of another PR. Cursor is what we use in the Lexer and other places where we need to parse text. Using Cursor everywhere has the benefit that maintainers are familiar with it, simplifying code reviews and code maintenance.

Agree, I believe we could use this technique fairly cleanly in both the lexer and parser with something like Cursor::eat_until_byte{1,2} which returns an Option<&str>.

dhruvmanila

Wow, this is pretty neat. Thanks for doing this!

crates/ruff_python_parser/src/string.rs

dhruvmanila · 2023-10-27T09:19:01Z

crates/ruff_python_parser/src/string.rs


-        if name.len() > MAX_UNICODE_NAME {


For reference, the relevant issue: RustPython/RustPython#3798

The constant value is now publicly available: https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/generator/src/lib.rs#L340, so the following should work.

use unicode_names2::MAX_NAME_LENGTH;

Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>

sno2 · 2023-10-27T13:47:05Z

@dhruvmanila The MAX_NAME_LENGTH is public in the generated file. But, the file that uses the constants does not mark them as public:

https://github.com/progval/unicode_names2/blob/22759d0e725a4c253e401dd8a5edf6d200008299/src/lib.rs#L70-L72

Would you like for me to re-copy the constant into our source?

(The reply box is not underneath your response for some reason.)

charliermarsh · 2023-10-28T22:51:05Z

Thanks @sno2, great to have you contributing!

While the usage looks correct, the use of `unsafe` here does not seem justified to me. Namely, it's already doing integer parsing. And perhaps most importantly, this is for parsing an octal literal which are likely to be rare enough to not have a major impact on perf. (And it's not like UTF-8 validation is slow.) This was originally introduced in #8227 and it doesn't look like unchecked string conversion was the main point there.

fix escape followed by unicode character

28eba86

sno2 marked this pull request as ready for review October 25, 2023 21:10

fix panic when unicode name starts with...

05acb28

multi-byte UTF-8 characters

MichaReiser added the parser Related to the parser label Oct 27, 2023

MichaReiser requested a review from dhruvmanila October 27, 2023 01:25

MichaReiser approved these changes Oct 27, 2023

View reviewed changes

push to string after error handling

af7ef4e

use find() instead of memchr

c577879

dhruvmanila approved these changes Oct 27, 2023

View reviewed changes

dhruvmanila added the performance Potential performance improvement label Oct 27, 2023

Update crates/ruff_python_parser/src/string.rs

28b823f

Co-authored-by: Dhruv Manilawala <dhruvmanila@gmail.com>

charliermarsh merged commit 2f5734d into astral-sh:main Oct 28, 2023
17 checks passed

sno2 deleted the perf/parser-string branch October 28, 2023 23:01

miccal mentioned this pull request Nov 3, 2023

ruff 0.1.4 Homebrew/homebrew-core#153286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): use faster string parser methods #8227

perf(parser): use faster string parser methods #8227

sno2 commented Oct 25, 2023 •

edited

codspeed-hq bot commented Oct 25, 2023 •

edited

github-actions bot commented Oct 25, 2023 •

edited

MichaReiser commented Oct 26, 2023

charliermarsh commented Oct 26, 2023

sno2 commented Oct 26, 2023 •

edited

MichaReiser left a comment

MichaReiser Oct 27, 2023

sno2 Oct 27, 2023 •

edited

dhruvmanila Oct 27, 2023

charliermarsh Oct 28, 2023

sno2 Oct 28, 2023 •

edited

charliermarsh Oct 28, 2023

sno2 commented Oct 27, 2023 •

edited

dhruvmanila left a comment

dhruvmanila Oct 27, 2023

sno2 commented Oct 27, 2023

charliermarsh commented Oct 28, 2023

perf(parser): use faster string parser methods #8227

perf(parser): use faster string parser methods #8227

Conversation

sno2 commented Oct 25, 2023 • edited

Summary

Test Plan

codspeed-hq bot commented Oct 25, 2023 • edited

CodSpeed Performance Report

Merging #8227 will improve performances by 22.78%

Summary

Benchmarks breakdown

github-actions bot commented Oct 25, 2023 • edited

PR Check Results

Ecosystem

MichaReiser commented Oct 26, 2023

charliermarsh commented Oct 26, 2023

sno2 commented Oct 26, 2023 • edited

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser Oct 27, 2023

Choose a reason for hiding this comment

sno2 Oct 27, 2023 • edited

Choose a reason for hiding this comment

dhruvmanila Oct 27, 2023

Choose a reason for hiding this comment

charliermarsh Oct 28, 2023

Choose a reason for hiding this comment

sno2 Oct 28, 2023 • edited

Choose a reason for hiding this comment

charliermarsh Oct 28, 2023

Choose a reason for hiding this comment

sno2 commented Oct 27, 2023 • edited

dhruvmanila left a comment

Choose a reason for hiding this comment

dhruvmanila Oct 27, 2023

Choose a reason for hiding this comment

sno2 commented Oct 27, 2023

charliermarsh commented Oct 28, 2023

sno2 commented Oct 25, 2023 •

edited

codspeed-hq bot commented Oct 25, 2023 •

edited

github-actions bot commented Oct 25, 2023 •

edited

sno2 commented Oct 26, 2023 •

edited

sno2 Oct 27, 2023 •

edited

sno2 Oct 28, 2023 •

edited

sno2 commented Oct 27, 2023 •

edited