emailbook.janet (including benchmark) #1419

MaxGyver83 · 2024-03-01T13:24:45Z

MaxGyver83
Mar 1, 2024

Hi all,

this my first real Janet project: emailbook-janet: A minimalistic address book for e-mails only (mainly for aerc)

Background

I use aerc for e-mails. aerc doesn't have an address book or auto-completion for e-mail addresses built-in. But you can configure an external tool for this job. When I searched such a tool, I found and picked aercbook: Minimalistic address book for aerc. aercbook does a great job but it doesn't always behave like I want. For example, it doesn't autocomplete e-mail addresses when there is a display name for this contact. (Of course, it inserts the e-mail address as well but you can't search for it.) This is why I first wrote a single shell script wrapper to grep the address book. Then I reimplemented everything as a (POSIX) shell script (with different behavior here and there): emailbook: A minimalistic address book for e-mails only (mainly for aerc). I'm happy with it. It works well and fast enough for what it was made for: Parsing a single e-mail for e-mail addresses and filtering the address book.

But then I tried parsing thousands of old e-mails on my hard disk at once. And here it showed how slow the shell script implementation is. Then I thought this might be a good use case for Janet. I simply wanted to see how easy or hard it is: Rewriting this in Janet in general and replacing all these regular expressions with parsing expression grammar. And of course how well it would perform.

Benchmark

Parse 1000 e-mails using these tools:

aercbook (Zig)
emailbook-janet (Janet, compiled)
emailbook-janet (Janet, script)
emailbook (POSIX shell script using grep, sed and perl)

Two methods:

Pipe 1000 e-mails into 1000 instances of the tool.
Loop over all 1000 e-mails extracting the headers (with sed)
and pipe this stream into a single instance of the tool.
(This does not work for aercbook because it stops after finding one
To:, From:, ... header field.)
UPDATE: In case of emailbook-janet, this loop is the bottleneck.
This is why I have added a new option --parse-files to read a list of
filenames from stdin. Then emailbook-janet open these files directly.
(This applies both to the script and the compiled version.)

Durations in seconds

	aercbook	emailbook-janet-bin	emailbook-janet	emailbook
1000 instances	0.9	5.0	13.1	94.9
1 instance	-	2.5	2.6	72.2

Updated numbers after applying the ideas from the comments and also improving the shell script emailbook:

	aercbook	emailbook-janet-bin	emailbook-janet	emailbook
1000 instances	0.9	4.4	12.4	66.1
1 instance	-	1.1	1.1	10.3

As you can see, Janet does a good job!

Review

If you are more experienced with Janet than I am (the chances for this are good), maybe you could have a look at my Janet script and tell me how I could do better.

Regarding the custom argument parsing: I'm planning to replace it with ianthehenry/cmd: command-line argument parser for Janet.

sogaiu · 2024-03-02T14:20:15Z

sogaiu
Mar 2, 2024

I was looking near the top of the emailbook.janet file and noticed:

(try (import ./deps/base64 :as base64)
  ([err fib] (import spork/base64)))
(try (use judge)
  ([err fib] (defn test [&] nil)))

I don't think that will work.

I tried putting the following in a file named sample.janet:

(try
  (use judge)
  ([_]
   (defn test [&] nil)))

(print test)

When I ran it with janet I got:

sample.janet:6:1: compile error: unknown symbol test

There might be a better way to address the judge case, but here's something that seems to work here:

(defn test [&] :a)

(protect (use judge))

(print (test :smile))

May be not the prettiest thing.

Perhaps @ianthehenry has a better idea.

2 replies

ianthehenry Mar 2, 2024

Yeah, defining test first and then conditionally shadowing it is smart. You probably want (defmacro test [&] nil) though because not all test-able forms can be evaluated as function arguments. You can also setdyn from within the try block, something like:

(setdyn 'test @{:private true :macro true :value (fn [&] nil)}

(I didn't test that so I'm not sure I got it right.)

Another idea:

(try (use judge)
  ([err fib] (eval '(defmacro test [&] nil)))

Because eval will evaluate it as if it were a top-level expression. Also did not test that though...

MaxGyver83 Mar 2, 2024
Author

Thank you! All your solutions work. I'll update my script.

ianthehenry · 2024-03-02T18:44:13Z

ianthehenry
Mar 2, 2024

Hey cool! Some notes on the implementation:

You could use compound PEGs instead of unquoting components:

(def bracket-email
  ~{:main (sequence "<" :plain-email ">")
    :plain-email (sequence (some :address-char) "@" (some :address-char))
    :address-char (if-not (+ :s (set ",<>@\"'")) 1)})

For small examples this doesn't really matter, but with this approach you only compile the address-char once instead of twice.

Though I see that you dynamically create some PEGs at runtime though that reference these so maybe that's tricky? But also you might have a better time if you avoid dynamically creating PEGs.

e.g. your mailbox-entry-exists? function creates a new PEG and compiles it every time it's called. This might be expensive depending on how often it's called. You can precompile a single PEG that takes arguments, and use cmt to reference those arguments. Something like this:

(def peg (peg/compile ~(* "foo: " (cmt (* (argument 0) '(to -1)) ,=))))
(peg/match peg "foo: bar" 0 "bar")
(peg/match peg "foo: bar" 0 "baz")

Although that specific pattern is simple enough that not using a PEG would probably be faster. But maybe it would be useful elsewhere in the script? Though I kinda think moving value checks outside of the PEG would be even better. E.g. parse a string, then compare it against the expected mailbox in toggle-quotes rather than putting the mailbox literal in the PEG. This is maybe subjective.

Running peg/compile at the top-level on all of the PEGs you reference at runtime might help reduce the runtime of the compiled version (also the interpreted version depending on how many times these are called). That way you compile the PEGs at compile time and can reference the precompiled PEG bytecode at runtime. You do this for e.g. some-spaces but can do it for display-name/bracket-email as well. You can also hoist pattern out of decode-utf8-base64 and precompile that... oh, but you use it twice. I'd do something like this instead:

(def pattern (peg/compile
  ~{:main (* :charset '(to "?=") "?=")
    :charset (* "=?" (+ "UTF" "utf") "-8?B?")}))
(defn decode-utf8-base64 [line]
  (peg/replace-all pattern (fn [_ bytes] (base64/decode bytes)) line))

Having read through the whole script I see that you have a lot of dynamically-constructed PEGs so maybe it might be tricky to precompile all of them. But if you wanted to try to optimize it I think that that would be a good place to look -- make PEGs just do parsing, and do value checking outside of that. This probably won't make much difference if you're running this as an interpreted script (though that depends on how many times you call those functions).

You use deep= in a few places when comparing strings. But strings in Janet have value-semantics, so you can just use = for them. Buffers have reference semantics, and need deep=. Maybe some of the deep=s actually operate on buffers? But e.g. the command-line parsing checks are all on strings.

Entirely subjective but:

(if quoted-mailbox
    (break (string (first quoted-mailbox) (second quoted-mailbox))))

Could be:

(if-let [[first second] quoted-mailbox]
    (break (string first second))

And another entirely subjective thing:

(def mailbox-sanitized (sanitize (decode-iso8859-q (decode-utf8-q (decode-utf8-base64 mailbox)))))
(def mailbox-sanitized (-> mailbox decode-utf8-base64 decode-utf8-q decode-iso8859-q sanitize))

I would expect that cmd will slightly slow down Janet compilation, when translates to runtime when you use this in the shebang style.

Meta thing but you could use GNU parallel to process files in batches -- spawning 1000 processes at once is probably going to be slower than spawning say 8 processes at a time (assuming an 8-core machine).

6 replies

MaxGyver83 Mar 3, 2024
Author

You use deep= in a few places when comparing strings. But strings in Janet have value-semantics, so you can just use = for them. Buffers have reference semantics, and need deep=. Maybe some of the deep=s actually operate on buffers? But e.g. the command-line parsing checks are all on strings.

Thank you! I think replacing deep= with = was the biggest performance improvement.

(def pattern (peg/compile
  ~{:main (* :charset '(to "?=") "?=")
    :charset (* "=?" (+ "UTF" "utf") "-8?B?")}))
(defn decode-utf8-base64 [line]
  (peg/replace-all pattern (fn [_ bytes] (base64/decode bytes)) line))

This wasn't faster but this function is much simpler. Applied!

(def mailbox-sanitized (sanitize (decode-iso8859-q (decode-utf8-q (decode-utf8-base64 mailbox)))))
(def mailbox-sanitized (-> mailbox decode-utf8-base64 decode-utf8-q decode-iso8859-q sanitize))

Applied!

Having read through the whole script I see that you have a lot of dynamically-constructed PEGs so maybe it might be tricky to precompile all of them. But if you wanted to try to optimize it I think that that would be a good place to look -- make PEGs just do parsing, and do value checking outside of that. This probably won't make much difference if you're running this as an interpreted script (though that depends on how many times you call those functions).

I have tried this but it wasn't faster. My assumption: I save time by having to compile this generic PEG only once instead of 3000 times (1000 e-mails, maybe 3 e-mail addresses per mail). But then the pattern matching is slower because you always parse the input until the end and keep the match. When the pattern contains ": John Doe", you can abort early in most of the times (when there isn't a J after the colon).

Entirely subjective but:

(if quoted-mailbox
    (break (string (first quoted-mailbox) (second quoted-mailbox))))

Could be:

(if-let [[first second] quoted-mailbox]
    (break (string first second))

Yes, this code really looks clumsy. I went for this simplification instead:

(if quoted-mailbox
  (break (string/join quoted-mailbox)))

(Maybe I need to get used to if-let first :-) )

Meta thing but you could use GNU parallel to process files in batches -- spawning 1000 processes at once is probably going to be slower than spawning say 8 processes at a time (assuming an 8-core machine).

Yes, this makes sense! (I don't want to do this for the benchmark but for the real life use case.)

Regarding your remaining ideas: I'll play around with a bit more. But I don't expect big performance improvements anymore.

This is the result of your suggestion (so far):

Before:

	emailbook-janet-bin	emailbook-janet
1000 instances	5.0	13.1
1 instance	2.5	2.6

After:

	emailbook-janet-bin	emailbook-janet
1000 instances	4.7	12.6
1 instance	2.5	2.5

Thanks a lot. It's fun learning more about Janet.

ianthehenry Mar 3, 2024

How are you timing this? Averaging 2.5ms to parse an email header seems like a very long time to me, although I admit I don't totally get what the script is doing apart from this. But 1000 email headers just isn't that much data!

MaxGyver83 Mar 4, 2024
Author

I use this script:
Test script used to measure the performance of emailbook, emailbook-janet and aercbook

This is what emailbook-janet does (times for parsing 1000 e-mails in a single emailbook.janet instance):

Parse the headers → 1.93 seconds
+ sanitize the mailboxes (incl. decoding MIME encoded-word syntax) → still 1.93 s
+ write to address book → 1.93 s
+ check (and skip writing) if an entry already exists → 2.04 s
+ logging to stdout → 2.45 s

... I just tested something else:

for file in $(ls -1 "$mailfolder" | head -n $count); do
    sed '/^$/Q' "$mailfolder/$file"
done

This loop alone takes 1.93 seconds (with stdout redirected to /dev/null)!

The other loop (no sed, only cat, but 1 emailbook.janet instance per mail) takes only 1.7 seconds the first time and about 0.64 seconds when called again (I guess because of Linux' file caching!?).

Unfortunately, I can't use this faster loop with a single instance because emailbook.janet stops parsing at the first empty line.

MaxGyver83 Mar 4, 2024
Author

Maybe I should change this script to accept a list of files and read them directly...

MaxGyver83 Mar 4, 2024
Author

I have added a new option --parse-files to read a list of (e-mail) filenames from stdin (cc999f6).

In this case, the test is executed like this:

cd "$mailfolder"
ls | head -n 1000 | ~/repos/emailbook-janet/emailbook.janet /tmp/emailbook-janet.txt --parse-files --all

New results:

	emailbook-janet-bin	emailbook-janet
1 instance	1.3	1.3

MaxGyver83 · 2024-03-04T13:24:36Z

MaxGyver83
Mar 4, 2024
Author

saikyun/janet-profiling helped me to identify further bottlenecks.
Commit 2ee1044 reduces the execution time from 1.3 to 1.1 seconds. When I use the script's --quiet flag (or >/dev/null), it takes only 0.86 seconds to process 1000 e-mails.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emailbook.janet (including benchmark) #1419

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

emailbook.janet (including benchmark) #1419

MaxGyver83 Mar 1, 2024

Background

Benchmark

Two methods:

Durations in seconds

Review

Replies: 3 comments · 8 replies

sogaiu Mar 2, 2024

ianthehenry Mar 2, 2024

MaxGyver83 Mar 2, 2024 Author

ianthehenry Mar 2, 2024

MaxGyver83 Mar 3, 2024 Author

Before:

After:

ianthehenry Mar 3, 2024

MaxGyver83 Mar 4, 2024 Author

MaxGyver83 Mar 4, 2024 Author

MaxGyver83 Mar 4, 2024 Author

MaxGyver83 Mar 4, 2024 Author

MaxGyver83
Mar 1, 2024

Replies: 3 comments 8 replies

sogaiu
Mar 2, 2024

MaxGyver83 Mar 2, 2024
Author

ianthehenry
Mar 2, 2024

MaxGyver83 Mar 3, 2024
Author

MaxGyver83 Mar 4, 2024
Author

MaxGyver83 Mar 4, 2024
Author

MaxGyver83 Mar 4, 2024
Author

MaxGyver83
Mar 4, 2024
Author