parse-zoneinfo: replace rule parser with simple state machine #172

djc · 2024-04-15T21:28:46Z

The raw diffstat of +4957/-371 doesn't look so attractive, but this account new tests (and accompanying data) that account for about 4400 of those lines added, so all in all this doesn't add that much more code than it deletes. The benchmark example suggests it is about 10x faster and it drops a pretty big dependency.

pitdicker

Impressive you could write this in little time!

I wonder if it wouldn't take less code to initialize Rule with default values and update it's fields, instead of moving around all fields to the next variant in RuleState.

Do you want to convert the zone, continuation and link line parsers in the same PR?

djc · 2024-05-06T20:26:16Z

parse-zoneinfo/tests/snapshot.rs

+use parse_zoneinfo::line::{Line, LineParser};
+use parse_zoneinfo::FILES;
+
+#[ignore]


I added #[ignore] here because this test will fail every time we update the tz data. Not sure how big of a pain in the ass that will be to update? Using the cargo-insta tooling it is pretty easy so we might decide to just include this.

parse-zoneinfo/src/line.rs

djc · 2024-05-06T20:31:16Z

So the package test fails because I've made chrono-tz-build depend on the FILES list newly duplicated in parse-zoneinfo (to help with the snapshot test), but this doesn't work for packaging (which tests against the published version). I guess we can keep the FILES list duplicated in the repo for now and drop it once we release a new version of parse-zoneinfo?

pitdicker · 2024-05-07T17:53:16Z

I'll have a look tomorrow (also on the other PR).

pitdicker

I have not reached the end yet 😄.

It seems at some point (2017c?) zic became case-insensitive for things like Rule, Zone, Link, weekdays, month names, last. Something we should eventually support?

pitdicker · 2024-05-09T04:37:58Z

chrono-tz-build/Cargo.toml

@@ -17,7 +17,7 @@ case-insensitive = ["uncased", "phf/uncased"]
 regex = ["dep:regex"]

 [dependencies]
-parse-zoneinfo = { version = "0.3" }
+parse-zoneinfo = { version = "0.3", path = "../parse-zoneinfo" }


I was thinking to make these changes in my next PR 👍.

pitdicker · 2024-05-09T04:43:18Z

parse-zoneinfo/src/lib.rs

@@ -38,3 +38,15 @@ pub mod line;
 pub mod structure;
 pub mod table;
 pub mod transitions;
+
+pub const FILES: &[&str] = &[


I am not sure we want to hardcode this list in parse-zoneinfo.

For my personal experiments the past year I removed backward, included backzone, occasionally included factory, and filtered parts of etcetera.

Maybe move the change out of this PR so we can discuss it separately?

pitdicker · 2024-05-09T04:47:54Z

parse-zoneinfo/src/line.rs

+        if input.chars().all(|c| c.is_ascii_digit()) {
+            return Ok(DaySpec::Ordinal(input.parse().unwrap()));
+        }
+        // Check if it stars with ‘last’, and trim off the first four bytes if


Suggested change

// Check if it stars with ‘last’, and trim off the first four bytes if

// Check if it starts with ‘last’, and trim off the first four bytes if

pitdicker · 2024-05-09T04:48:52Z

parse-zoneinfo/src/line.rs

+            return Ok(DaySpec::Ordinal(input.parse().unwrap()));
+        }
+        // Check if it stars with ‘last’, and trim off the first four bytes if
+        // it does. (Luckily, the file is ASCII, so ‘last’ is four bytes)


We don't care about ASCII with strip_prefix, right? This seems an old comment.

pitdicker · 2024-05-09T04:51:17Z

parse-zoneinfo/src/line.rs

+            return Ok(DaySpec::Last(weekday));
+        }
+
+        let weekday = match input.get(..3) {


Cool, didn't know this method!

zic.c has the following comment for parsing a day column:

/* ** Day work. ** Accept things such as: ** 1 ** lastSunday ** last-Sunday (undocumented; warn about this) ** Sun<=20 ** Sun>=7 */

I think we should support parsing full weekday names like zic like we did with the regex, but maybe skip the last-{weekday} case.

Can you add a test for DaySpec::from_str?

pitdicker · 2024-05-09T05:20:51Z

parse-zoneinfo/src/line.rs

+impl FromStr for TimeSpecAndType {
+    type Err = Error;
+
+    fn from_str(input: &str) -> Result<Self, Error> {


Can you please split this method over the TimeSpec and TimeSpecAndType types? I am an not sure yet if anything but wall times is allowed zone lines, and if the existing code took a shortcut there that we want to fix.

pitdicker · 2024-05-09T05:29:06Z

parse-zoneinfo/src/line.rs

+                        from_year,
+                        to_year,
+                    },
+                    "-" | "\u{2010}",


Can you please add back the comment?

pitdicker · 2024-05-09T05:44:01Z

parse-zoneinfo/src/line.rs

+impl<'a> Rule<'a> {
+    fn from_str(input: &'a str) -> Result<Self, Error> {
+        let mut state = RuleState::Start;
+        for part in input.split_ascii_whitespace() {


This no longer parses a rule with a comment?

zic.c has a getfields method (line 3722) that returns when it encounters a comment sign #.
It also supports quotation marks " surrounding each field, within which whitespace and # is allowed. Maybe we should make an iterator that works similar instead of using split_ascii_whitespace?

pitdicker · 2024-05-09T05:47:01Z

parse-zoneinfo/src/line.rs

+        let mut state = ZoneInfoState::Start;
+        for part in iter {
+            state = match (state, part) {
+                (st, _) if part.starts_with('#') => {


In theory a comment is allowed to come straight after a field, without whitespace in between.

pitdicker · 2024-05-09T05:58:54Z

parse-zoneinfo/Cargo.toml

@@ -13,3 +13,6 @@ keywords = ["date", "time", "timezone", "zone", "calendar"]
 version = "1.3.1"
 default-features = false
 features = ["std", "unicode-perl"]
+
+[dev-dependencies]
+insta = "1.38"


I understand why you added this test. Not sure about it though.

Would it be better to add this test as a separate crate in the workspace?

djc requested a review from pitdicker April 15, 2024 21:28

pitdicker reviewed Apr 16, 2024

View reviewed changes

djc mentioned this pull request Apr 22, 2024

Use regex-lite for chrono-tz-build #170

Open

djc added 9 commits May 6, 2024 13:21

Use local copy of parse-zoneinfo

3dc14b5

Track files in parse-zoneinfo

413d981

parse-zoneinfo: add snapshot tests

2833191

parse-zoneinfo: replace day_spec parser with simple Rust code

e9a5ccb

parse-zoneinfo: use simple Rust code for parsing times

30164b7

parse-zoneinfo: replace rule parser with simple state machine

40c40db

parse-zoneinfo: use state machine for parsing zone lines

720830c

parse-zoneinfo: replace link parser with simple Rust code

a538210

parse-zoneinfo: use simple Rust code to parse empty lines

0a7f262

djc force-pushed the no-regexx branch from 8c821bb to 0a7f262 Compare May 6, 2024 20:21

djc commented May 6, 2024

View reviewed changes

pitdicker reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse-zoneinfo: replace rule parser with simple state machine #172

parse-zoneinfo: replace rule parser with simple state machine #172

djc commented Apr 15, 2024 •

edited

pitdicker left a comment

djc May 6, 2024

djc commented May 6, 2024

pitdicker commented May 7, 2024

pitdicker left a comment

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024

pitdicker May 9, 2024 •

edited

pitdicker May 9, 2024

pitdicker May 9, 2024

	// Check if it stars with ‘last’, and trim off the first four bytes if
	// Check if it starts with ‘last’, and trim off the first four bytes if

parse-zoneinfo: replace rule parser with simple state machine #172

Are you sure you want to change the base?

parse-zoneinfo: replace rule parser with simple state machine #172

Conversation

djc commented Apr 15, 2024 • edited

pitdicker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djc commented May 6, 2024

pitdicker commented May 7, 2024

pitdicker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitdicker May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djc commented Apr 15, 2024 •

edited

pitdicker May 9, 2024 •

edited