Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurability for unicode and skipping parsers #14

Merged
merged 3 commits into from Jul 23, 2023

Conversation

hamiltop
Copy link
Contributor

@hamiltop hamiltop commented Jul 9, 2023

Two main changes

  • Allow disabling unicode support. Unicode support greatly increases memory and latency and is not generally needed.
  • Allow skipping specific parser categories when loading from regex yaml. Compiling regex for unused parsers is a pretty big waste of memory.

Both unicode support and device/os/user_agent parsers can now be configured via UserAgentParserBuilder.

I tried very hard to keep this backwards compatible, but if we're willing to break backwards compatibility we can take these in a different direction.

- Unicode support greatly increases memory and latency and is not
generally needed.
- Compiling regex for unused parsers is just a waste of memory.

Both unicode support and device/os/user_agent parsers can now be
configured via `UserAgentParserBuilder`.
@hamiltop
Copy link
Contributor Author

hamiltop commented Jul 9, 2023

I dug in a bit after this discussion about memory usage: rust-lang/regex#1027 (comment)

I took at a look at other uaparser implementations and none of them are done with full unicode support. The uaparser-core test suite passes without it just fine. So I moved everything to use regex::bytes::Regex and make unicode a configuration option.

Additionally, we at Remind are only interested in OS for our use case. What we've been doing has been to take the core regex yaml file and delete the Device and User Agent sections, but it would be nice to not have to curate that ourselves and instead just pass in flags to only parse and use the os parsers. So I added the ability to disable any of os, device, and user_agent parsers to save on memory.

Some performance numbers with these changes:

First, memory usage. full_parser has all the parsers and includes unicode support.no_unicode_parser is all the parsers with unicode disabled. os_only_no_unicode_parser skips device and user_agent parsers and disables unicode (this is how we would use it at Remind).

➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/full_parser
Client { device: Device { family: "Mac", brand: Some("Apple"), model: Some("Mac") }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Chrome", major: Some("114"), minor: Some("0"), patch: Some("0") } }
        5.48 real         5.40 user         0.07 sys
           296796160  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               18210  page reclaims
                  11  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 101  involuntary context switches
         54894775582  instructions retired
         18604917601  cycles elapsed
           282281600  peak memory footprint
➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/no_unicode_parser
Client { device: Device { family: "Mac", brand: Some("Apple"), model: Some("Mac") }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Chrome", major: Some("114"), minor: Some("0"), patch: Some("0") } }
        1.01 real         0.99 user         0.01 sys
            63062016  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                3949  page reclaims
                   6  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  33  involuntary context switches
         10472472398  instructions retired
          3381014760  cycles elapsed
            59262016  peak memory footprint
➜  uap-rs git:(main) ✗ /usr/bin/time -l target/debug/examples/os_only_no_unicode_parser
Client { device: Device { family: "Other", brand: None, model: None }, os: OS { family: "Mac OS X", major: Some("10"), minor: Some("15"), patch: Some("7"), patch_minor: None }, user_agent: UserAgent { family: "Other", major: None, minor: None, patch: None } }
        0.16 real         0.16 user         0.00 sys
            10469376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 740  page reclaims
                   5  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   9  involuntary context switches
          1548425921  instructions retired
           497911781  cycles elapsed
             6668992  peak memory footprint

Disabling Unicode takes us from 300MB to 60MB, and only doing the OS parsers takes that all the way to 10MB.

Beyond memory, the cargo bench results are also nice to see:

parse_device            time:   [578.92 ms 581.72 ms 584.82 ms]                         
                        change: [-0.9028% +0.0193% +1.0401%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

parse_device unicode disabled                                                                            
                        time:   [496.35 ms 498.60 ms 500.98 ms]
                        change: [+0.0428% +1.2668% +2.4227%] (p = 0.05 > 0.05)
                        No change in performance detected.

parse_os                time:   [2.3230 ms 2.3356 ms 2.3488 ms]                     
                        change: [-2.9222% -1.7137% -0.4659%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

parse_os unicode disabled                                                                            
                        time:   [2.3017 ms 2.3118 ms 2.3228 ms]
                        change: [-0.5476% +0.1792% +0.9374%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

parse_user_agent        time:   [174.95 ms 175.33 ms 175.73 ms]                              
                        change: [-1.2574% -0.5656% +0.1207%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

parse_user_agent unicode disabled                                                                             
                        time:   [56.693 ms 56.998 ms 57.326 ms]
                        change: [-2.7180% -1.5820% -0.3546%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

A huge speedup on the user_agent parsing (57ms vs 175ms), a moderate speedup on device parsing (498 vs 581), and not much change on OS.

I left the unicode support in as the default to make this a non-breaking change, but honestly it's just not needed and I'd recommend we rip it out and just make everything not support unicode.

Anyway, let me know what you think.

captures.expand(raw_replacement, &mut target);
std::str::from_utf8(&target)
.map(|s| Cow::Owned(s.trim().to_owned()))
// What is the behavior if we can't parse a string???
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only big question mark in behavior. If a replacement has a group and the result of the capture expands into a &[u8] that can't be turned into a string, then we have to do something. Rather than panic or try and surface this, I chose to just skip the replacement.

This may never happen. It likely won't. I'm not sure if a panic is better or worse here and am very open to feedback.

@hamiltop
Copy link
Contributor Author

hamiltop commented Jul 9, 2023

This example is a good example of how the new builder works and the configuration available https://github.com/davidarmstronglewis/uap-rs/pull/14/files#diff-f2849e4e13ac1633666e5ef659e790e06feb21fa5848bf7268b80e7e7de39a94

user_agent: bool,
unicode: bool,
) -> Result<UserAgentParser, Error> {
let device_matchers = if device {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored a few options on how to disable/enable a specific class of parser. Using crate features was neat, but felt overkill for something that could just be provided as config. The most correct approach is probably making the Vec<device::Matcher> an Option<Vec<device::Matcher>> and making Client have optional fields for all the matches, but that was a bigger footprint and I wasn't sure if that was worth it. An empty Vec is what we're getting now with our stripped down regex yaml file and it works well enough.

@oceanlewis
Copy link
Owner

Hey @hamiltop, thank you for the lovely PR and for linking out to context around these changes. Solid change, my only feedback would be to hide the new UserAgentParserBuilder behind a builder method rather than exporting the type directly. In case you've moved on to something else I've pushed up this branch which does just that (and updates some Nix stuff; which you can ignore).

Once this is in I'll cut a new release. Thanks again ❤️

@hamiltop
Copy link
Contributor Author

Great. Updated from your branch and then fixed up a few tests to use the new syntax. Should be g2g here.

Comment on lines 3 to 10
// fn foo() {
// /// ```rust
// /// # use uaparser::*;
// let _parser = UserAgentParser::builder()
// .unicode(false)
// .user_agent()
// /// ```
// }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think I left this in here erroneously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with a fixup commit and rebased.

@hamiltop hamiltop force-pushed the disable_unicode_and_more_config branch from 2216e6c to d7a94b3 Compare July 20, 2023 20:38
Copy link
Owner

@oceanlewis oceanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll merge this and cut a new release later tonight.
Thank you for the contribution ❤️

@oceanlewis oceanlewis merged commit 71a5f66 into oceanlewis:main Jul 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants