Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compose: add a cache for parsed files #220

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

linkmauve
Copy link
Contributor

@linkmauve linkmauve commented Mar 21, 2021

Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the startup of GLFW on my PinePhone according to perf, this patch lowers it by about 33ms down to approximately 2%.

This adds an optional dependency on xxhash, its BSD two clauses license is compatible and it was the fastest non-cryptographic hash I could find.

I went for a fopen()/fread() approach in order not to change the internal structs in any way, but with some simple changes we could do a mmap() and avoid the reading cost altogether, additionally letting multiple programs share a single read-only memory page.

Signed-off-by: Emmanuel Gil Peyrot linkmauve@linkmauve.fr

@linkmauve linkmauve force-pushed the compose-cache branch 8 times, most recently from fcf7de4 to 564a25f Compare March 21, 2021 01:32
Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the
startup of GLFW on my PinePhone according to perf, this patch lowers it
by about 33ms down to approximately 2%.

This adds an optional dependency on xxhash.

Signed-off-by: Emmanuel Gil Peyrot <linkmauve@linkmauve.fr>
Signed-off-by: Emmanuel Gil Peyrot <linkmauve@linkmauve.fr>
@bluetech
Copy link
Member

Parsing /usr/share/X11/locale/en_US.UTF-8/Compose takes about 21% of the startup of GLFW on my PinePhone according to perf, this patch lowers it by about 33ms down to approximately 2%.

Interesting. I would definitely want this to work well on PinePhone and these numbers do sound like a lot.

Some historical context: the original libX11 Compose implementation does in fact have a cache (~/.compose-cache/XCOMPOSECACHE, code is here). The libxkbcommon implementation is new, and I explicitly chose not to add a cache, for the following reasons:

  • Complicates the code
  • Cache invalidation trouble
  • Filesystem interaction trouble
  • Still doesn't help for the first run

In that spirit, I'd like to first see if we can make the Compose parsing itself fast enough so that a cache is not necessary.

I had a quick look now and committed 7d84809 which brings down the timing from 8ms to 5.3ms on my old laptop, according to bench/compose. I think we can optimize it more.

I'd be interested to see the perf output from the phone, as well as the absolute timings. Maybe something is particularly slow there.

@linkmauve
Copy link
Contributor Author

linkmauve commented Mar 28, 2021

Hi, and thanks for your interest! I just profiled again, before any optimisation (on 95e2907) xkb_compose_table_new_from_locale() used to take 38~43ms (I didn’t change anything in the scheduler or frequency scaling of the phone, hence the quite big variance), after them (7d84809) 35~39ms, and with the cache from this PR ~3.4ms.

I’ve attached the perf and flamegraph of three of these runs: perf+flamegraph.tar.gz

It doesn’t seem like the case-sensitive optimisation had such a big effect on AArch64, I’ll have a look whether glibc has the same optimisation there as on x86, since that’s a bit weird there was such a big change on your laptop and almost none on my phone.

As for the issues with the cache, we might have a look how other libraries such as Mesa are handling it, I’d trust them on that. :)

Note that xkb_keymap_new_from_string() also takes 15~18ms and would be my next target for caching.

Edit: it seems glibc has no asm whatsoever for strcmp, it’s all relying on the compiler doing a good job at autovectorisation.

@bluetech
Copy link
Member

xkb_compose_table_new_from_locale() used to take 38~43ms (I didn’t change anything in the scheduler or frequency scaling of the phone, hence the quite big variance), after them (7d84809) 35~39ms, and with the cache from this PR ~3.4ms.

Hmm pretty interesting that it had no effect.

Unfortunately when I try to perf report -f the perf files, it's missing all the symbols, maybe it needs the original binaries, not sure how that works. But I can't quite see the details there.

I refreshed my memory on the code a bit more. Basically, at least on my machines, the time is mostly split up between these 3 parts:

  1. Lexing -- I don't have any simple ideas on how to optimize that at the moment.

  2. Keysym name to keysym value mapping -- the commit in master was a first shot but apparently not enough. Currently it is using a binary search which does a lot of strcmps. I changed it to use a perfect hash in this branch: https://github.com/bluetech/libxkbcommon/commits/perfect-hash (last commit, not polished but sufficient as a POC). This does just a hash+one strcmp.

  3. Parsing and constructing the data structure -- currently using a simple trie with linear search for each chain. The chains can get pretty long. I changed it to use a ternary search tree here: https://github.com/bluetech/libxkbcommon/commits/ternary-tree. Also unfinished but good enough for testing.

I'd be curious to know how it does with these two patches applied. (Also I assume you're using an optimized build, maybe also LTO).

Note that xkb_keymap_new_from_string() also takes 15~18ms and would be my next target for caching.

Amusingly it was the same thing for the keymaps -- the original xkbcomp has a cache, we got rid of it.

Edit: it seems glibc has no asm whatsoever for strcmp, it’s all relying on the compiler doing a good job at autovectorisation.

glibc is rather unreadable but the ASM for arm64 is found here:

https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aarch64/strcmp.S;h=f225d718f4bef0781896d7331b3c4e6667df135c;hb=HEAD

If for some reason it's not used that can explain some of the slowness.

@linkmauve
Copy link
Contributor Author

linkmauve commented Apr 2, 2021

I’ve just tested your latest improvements from master on my phone, they bring it down from ~34ms to ~19ms, nice!

As for perf not giving you relevant function names, do you want me to send you the DWARF data from my built library?

I checked and the strcmp() I’m running is indeed from that one AArch64 source, so no worries on that end (that’d have been surprising!).

@bluetech
Copy link
Member

bluetech commented Apr 2, 2021

I’ve just tested your latest improvements from master on my phone, they bring it down from ~34ms to ~19ms, nice!

For me it's down to 2.8ms so I hoped it would be better, but still not bad. Thanks for testing again.

As for perf not giving you relevant function names, do you want me to send you the DWARF data from my built library?

Sure, maybe that will do the trick. Still curious what is slowing things down - cache misses, branch mispredictions, memory fetches, or just plain CPU time (if I'm reading the right place, the PinePhone has a 1.2GHz CPU). So BTW the output of perf stat -ddd ./build/bench-compose would also be good for comparison.

If possible, maybe also a callgrind output file? (Don't know it that's portable, hopefully yes).

I checked and the strcmp() I’m running is indeed from that one AArch64 source, so no worries on that end (that’d have been surprising!).

Great (even if now strcmp is not called as much...)

@wismill wismill added enhancement Indicates new feature requests compose Indicates a need for improvements or additions to Compose handling labels May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compose Indicates a need for improvements or additions to Compose handling enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants