Make tz.tzstr fail if an invalid GNU tz string is provided #581

pablogsal · 2017-12-10T15:33:57Z

This pr fixes #259. As discussed with @pganssle in the dateutil sprint in London, the fix consist in modifiying dateutil.parser._parser._tzparser.parse to check if all tokens generated from the input string have been consumed (except separators that are assumed to be "," and ":" as specified here). To have some control (and help future debugging) on the tokens that are not used, a mark of used_tokens is used internally to keep track of this. To avoid making the parser raise and therefore modifiying the current behaviour, a new slotted field was added to dateutil.parser._parser._tzparser._result that indicates if all tokens have been consumed or not. This is used in dateutil.tz.tz.tzstr to raise accordingly in the case that this check fails (so not all tokens have been consumed`.

8 test of the test suite for tz.tz.tzstr were failing because they were using invalid strings and now the call to tz.tz.tzstr raises. These test were using "default" arguments, that were intented to use the default values that the parser, tzstr and tzrange constructs. To fix these test, the input string has been changed for a GNUtz-compatible equivalent.

pganssle · 2017-12-10T15:38:31Z

@pablogsal Thanks Pablo!

@jbrockmendel Since you're the resident parser expert, do you want to take a look? I gave it a cursory glance and discussed this in person when it was in development.

pganssle

@pablogsal Do you think you can add tests to get the diff coverage up to 100%?

Also, I notice in the documentation he lists a bunch of supported formats, some of which seem to match the formats that are being changed here. We should probably either add support for those formats (which can be a separate PR) or change the documentation.

I'm also a little freaked out that so many tests had to be changed - I'm curious to know if they ever worked and if so when they broke (I'll take a look at this later). The fact that these formats were explicitly tested and mentioned in the documentation seems like pretty strong evidence that we shouldn't drop support for them, but if I understand this PR correctly, it seems we've never actually supported them (as far as I can tell), which makes me think that they're not widely used.

pablogsal · 2017-12-10T18:25:41Z

@pganssle The majority of those are default initialized in the parser, so the parser was ignoring the majority of the string and default initializing those. The "worked" but the string used to create the objects was not used in its majority. Therefore this format is not being correctly parsed. As a corollary, no mater what do you write in this format, the result will be the same. Example:

>>> import dateutil.tz
>>> s = "EST5EDT,4,1,0,7200,10,-1,0,7200,3600"
>>> s2 = "EST5EDT,4,4,3,7200,11,-1,0,7200,3600"
>>> tz1 = dateutil.tz.tzstr(s)
>>> tz2 = dateutil.tz.tzstr(s2)
>>> tz1.__dict__
{'_s': 'EST5EDT,4,1,0,7200,10,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True}
>>> tz2.__dict__
{'_s': 'EST5EDT,4,4,3,7200,11,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True}

The string formats were selected out of the previous values of the tests that have been changed. Notice that both generate relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)) and relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)) as start_delta and end_delta. So if this format was supported at some point, it seems incorrectly parsed.

pablogsal · 2017-12-10T19:05:45Z

After investing some time further analyzing whats going on I think I have spotted a major problem. From this string (not GNU):

"GMT0BST,3,0,30,3600,10,0,26,7200"

The lexer creates this tokens:

['GMT', '0', 'BST', ',', '3', ',', '0', ',', '30.3600', ',', '10.0', ',', '26.7200']

And the parser NEVER enters the condition for start parsing this format:

elif (8 <= l.count(',') <= 9 and not [y for x in l[i:] if x != ','  for y in x if y not in "0123456789"])

This is because the lexer has created "26.7200" and "30.3600" as tokens insted of using [26,7200] or [30,3600]. The current implementation of the parser does not like this.

This is even worse because the tokenization is not coherent between the start and end date. From the last tokenization, the start month is "3" but the end month is "10.0". It seems (just an impression) that the tokenizer was not prepared to parse this strings.

pganssle · 2017-12-10T19:35:23Z

Interesting. This touches on a discussion @jbrockmendel and I were having in #541.

I'm going to have to go through the history to see when exactly those tests broke, but I don't think that fixing this issue should block the acceptance of this PR, which solves a different goal anyway.

For now can you add back in the old tests and just use @pytest.mark.xfail on them?

I think just restoring the old tests and marking them xfail + adding tests that cover the new code you added should be good, though I'm interested to hear Brock's comments.

… format and add tests

pablogsal · 2017-12-10T20:35:06Z

@pganssle In 04d9c2e I have fixed the tokenizer and the parser and I have added test that test though multiple new (and previously uncovered) branches. The old tests are passing with the previous values (so they remain unchanged compared with the base branch) and they are testing correctly now because the parsing is working and being enforced. So no need to xfail them. It has been a big fight for my relax Sunday but I it seems is working as intended. 😄

pablogsal · 2017-12-10T20:44:41Z

The only problem now is that ParserTest.testPythonLoggerFormat is failing. This is due to a change in the tokenizer. This change was introduced in bc69c3f (which is the commit that probably broke the tests) and it seems is what broke the parser for the default strings because is making the tokenizer tokenize as ['GMT', '0', 'BST', ',', '3', ',', '0', ',', '30.3600', ',', '10.0', ',', '26.7200'] (the bad format). I will adapt the tokenizer to admit both possibilities.

pablogsal · 2017-12-10T20:50:59Z

1b678d4 Allows both behaviours and now everything is passing and the tests remain unchanged.

jbrockmendel · 2017-12-11T03:04:19Z

I'll take a close look at this but won't be able to get to it until tomorrow afternoon.

pganssle · 2017-12-11T18:03:50Z

dateutil/parser/_parser.py

@@ -57,7 +57,7 @@ class _timelex(object):
    # Fractional seconds are sometimes split by a comma
    _split_decimal = re.compile("([.,])")

-    def __init__(self, instream):
+    def __init__(self, instream, join_numeric_separated_by_comma=True):


Hm.. I don't love this. _timelex is semi-public, and this is not something I would ever put in a public API.

I agree. But we need either to select one behavior or implement some flag in this fashion or completely separate de dependencies. What is your preference?

Although I don't like this design I put this so (1) you can identify easily what the problem is and (2) we can discuss better a possible implementation of the fix.

Hmmm... I think I see what's going on here.

I find it a bit strange that _tzparser re-uses the lexer like this. I see three other possible ways of going here:

Split out _timelexer and _tzlexer into two different lexers with independent rules.

Do some sort of token splitting downstream based on the context.

Reverse the commit that keeps the comma token together and dynamically re-join it as necessary.

None of them are particularly satisfying.

Another way to go with your approach is to come up with some more satisfying name for that flag - if there's some unifying theme for the cases where you'd want it to be true, then I'm more OK with adding the flag.

The thing is, 2 and 3 seem wrong because they violate separation of concerns, but we may need to do either 2 or 3 if we want to also solve #203 and other comma-related problems caused by this.

If we want to move meanwhile we implement 1,2 or 3, we can use a new private API to _timelex that activates the flag so this is invisible to the end user. Is not good, but is lightweight and it will keep this PR logically separated from bigger fixes. I think 3 underscores will be enough for privacy (_parser._timelex._privatestuff). 😉

Having gone through the comments but not yet the code: is there a viable solution here that does not require addressing the _timelex design? If so, I expect that is the best short-term solution.

This touches on #541, and as @pganssle says ideally that design question should be addressed separately.

IIRC the comma-as-decimal thing is mostly there for the python logging format. If that is correct, we can more strictly check for something like '(?P<h>\d{1,2}):(?P<m>\d{2})(:(?P<s>\d{2}([,\.]\d+)?)) (though likely not using regex)

@jbrockmendel There are a large number of possible hacky workarounds we could do, but it's basically the same problem as with this PR - right now we have some tokens that are combined when they shouldn't be, so the options are:

Add an additional tokenizer step in the parser that splits these aggressively-combined tokens (no direct effect on _timelex, but does couple to the current behavior)

Make it so _timelex doesn't combine these anymore and add a re-combination step in the parser portion.

Some variation on Pablo's solution, e.g. "tell _timelex which behavior we want when we call it" - this can either be a flag like what Pablo is doing or we subclass _timelex to make a _tzlex, and the two vary depending on this behavior.

I don't see any other obvious solution, since in some contexts you want the tokens combined and in some contexts you want the tokens split.

jbrockmendel · 2017-12-13T16:09:47Z

dateutil/parser/_parser.py

-    def split(cls, s):
-        return list(cls(s))
+    def split(cls, s, join_numeric_separated_by_comma=True):
+        return list(cls(s,join_numeric_separated_by_comma))


flake8 whitespace after comma

jbrockmendel · 2017-12-13T16:10:14Z

dateutil/parser/_parser.py

@@ -77,6 +77,7 @@ def __init__(self, instream):
        self.charstack = []
        self.tokenstack = []
        self.eof = False
+        self.join_numeric_separated_by_comma = join_numeric_separated_by_comma


can you add a comment for why this option exists, namely for the python logging format

Is it just the python logging format? I think that it's also for the ISO-8601 format, which technically (though not in practice) prefers comma as the fractional separator.

Oh neat, I had no idea. I guess I've seen that used in conjunction with Euros, e.g. "€ 29.123,45"

Does the ISO usage need to be for any numbers or is it just for fractional seconds? i.e. will the check-for-colons above work?

Technically it's for all fractional time components, but I think basically no one uses anything except fractional seconds. Still, this is technically a valid ISO-8601 datetime:

2014-01-01T00,50 -> datetime(2014, 1, 1, 0, 30)

jbrockmendel · 2017-12-13T16:14:59Z

Is there a spec that can be linked to describing how tzparser is intended to work?

jbrockmendel · 2017-12-13T16:16:21Z

dateutil/parser/_parser.py

                    else:
                        offattr = "dstoffset"
                        res.dstabbr = "".join(l[i:j])
+                        for ii in range(j):
+                            used_tokens[ii] = True


This loop is done in both branches of this if/else. Can you de-indent and just do it once?

pganssle · 2017-12-13T16:25:54Z

@jbrockmendel That's part of the problem. _tzparser itself is actually private, but it's used in a few places internally. The closest thing to a "spec" is the examples in the examples documentation.

If we were to take tzstr's documentation at its word that it only supports GNU TZ strings, then we could use that as a spec, but the documentation and examples seem to indicate that it actually supports other formats as well.

jbrockmendel · 2017-12-13T16:33:49Z

dateutil/parser/_parser.py

@@ -1464,6 +1494,7 @@ def parse(self, tzstr):
        except (IndexError, ValueError, AssertionError):
            return None

+        res.unused_tokens = not {token for token,is_used in zip(l,used_tokens) if not is_used}.issubset({",",":"})


This took some staring out to figure out. If nothing else, please break this into a couple of lines starting with unused_tokens = {token for token,is_used in zip(l,used_tokens) if not is_used}.

Related, the name res.unused_tokens suggest that it is listlike giving the tokens themselves. If this is a bool, maybe something like res.any_unused_tokens?

More broadly, I'd suggest the following:

Implement used_tokens as used_idxs mirroring skipped_idxs in _parser._parse. Then each occurrence of used_tokens[i] = True becomes used_idxs.append(i). Then {token for token,is_used in zip(l,used_tokens) if not is_used} becomes:

unused_idxs = set(range(len_l)).difference(used_idxs) unused_tokens = {l[n] for n in unused_idxs}

jbrockmendel

Minor style comments, generally LGTM conditional upon finding a solution to the _timelex issue.

Is using timelex.split even necessary here? Would just splitting on commas+colons+slashes+whitespace get the job done?

pganssle · 2017-12-13T16:42:41Z

Is using timelex.split even necessary here?

This is kinda what I was thinking. It's a little weird that _timelex is being reused for this purpose.

… changes

pablogsal · 2017-12-14T00:29:42Z

@pganssle @jbrockmendel In 4f9e235 I have implemented a simple regexp lexer that makes the trick (and is 35x +- 1.2 times faster) so this leaves _timelex and its problems unchanged and therefore is a problem for another day/PR. Notice that this lexer is not as straightforward as splitting on commas+colons+slashes+whitespace because you have to split into numeric/non numeric as well. In addition, I have addressed all the style feedback that you indicated. 😄

Also tzparse is now 97.56 % covered, which is a noticeable improvement compared with the previous situation.

pablogsal · 2017-12-14T00:41:38Z

dateutil/parser/_parser.py

                        i += 1
                    if res.dstabbr:
                        break
                else:
                    break

+
            if i < len_l:
                for j in range(i, len_l):
                    if l[j] == ';':


Can I eliminate this as a possibility for an alternate separator? Is not being hit by the test and clearly does not work if you consistently use ";" as a separator. (Line 1380)

I really wish I knew, honestly. :( It's really hard to tell what people are using this for (if they're using it at all).

jbrockmendel · 2017-12-14T01:30:19Z

dateutil/parser/_parser.py

@@ -1313,7 +1315,8 @@ def __init__(self):

    def parse(self, tzstr):
        res = self._result()
-        l = _timelex.split(tzstr)
+        l = [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)',tzstr) if x]


If we see something like "foo,:"bar with two of [,:.] in a row is that indicative of an invalid string?

That is my understanding. The parser will fail in that case. I will push for the tokenizer not doing anything else than tokenize and just let the parser naturally fail to parse a consecutive [,:.].

The reason why I ask is to see if the if x clause is necessary.

This is necessary for Python<3.6. This is due to the way re.split works in this two different versions. I have made a failing commit to prove this in b15f38a. Here are the CI results of that commit. 3428f2f reverts b15f38a.

I'm not sure I understand this regular expression completely, but I think the only problem with this approach that I see is that any split based function is going to discard the tokens themselves. Are there really no constraints on what the inputs can be for valid TZ strings?

Also, with the if x filter, how would you distinguish between One,Two and One,,Two?

In [6]: [x for x in re.split(r'[.,;]', 'One,Two') if x] Out[6]: ['One', 'Two'] In [7]: [x for x in re.split(r'[.,;]', 'One,;Two') if x] Out[7]: ['One', 'Two']

You forgot the capturing group. If you use the same whole expression with the capturing group included you get:

>>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,Two') if x] ['One', ',', 'Two'] >>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,;Two') if x] ['One', ',', ';', 'Two']

Oh wow, that's a neat trick. I didn't realized it worked like that. Good to know.

pganssle · 2017-12-14T17:41:59Z

@jbrockmendel Any other comments on this? I'm pretty happy with this solution so far.

pganssle · 2017-12-14T17:59:31Z

dateutil/parser/_parser.py

@@ -1464,6 +1492,8 @@ def parse(self, tzstr):
        except (IndexError, ValueError, AssertionError):
            return None

+        unused_idxs = set(range(len_l)).difference(used_idxs)
+        res.any_unused_tokens = not {l[n] for n in unused_idxs}.issubset({",",":"})


It's kinda amazing to me that this seems to be universally faster than any(l[n] not in {',', ':'} for n in unused_idxs), even when you pull the {',', ':'} creation out of the loop.

Maybe if you have some absurd number of tokens the set creation portion would get faster and this would get slow.

jbrockmendel · 2017-12-15T00:52:41Z

LGTM

pganssle · 2017-12-15T13:47:18Z

OK, I'll clean this up a bit to my exacting standards that feel too nitpicky to mention in an actual code review and then merge.

pganssle · 2017-12-18T18:14:30Z

@pablogsal I have made some changes, mostly to actually test the behavior of tzstr rather than simply check that it doesn't throw an error. Please let me know if you think I've done it wrong. Once you approve, I think it's ready for merge.

pganssle · 2017-12-19T18:59:40Z

I'm just going to go ahead and merge. If there are issues they can be addressed in a separate PR.

pablogsal force-pushed the issue259 branch 2 times, most recently from 8b3c14e to 3431b26 Compare December 10, 2017 15:35

Make tz.tzstr fail if an invalid GNU tz string is provided

54341f7

pablogsal force-pushed the issue259 branch from 3431b26 to 54341f7 Compare December 10, 2017 15:36

pganssle requested a review from jbrockmendel December 10, 2017 15:38

pganssle added this to the 2.7.0 milestone Dec 10, 2017

pganssle reviewed Dec 10, 2017

View reviewed changes

Fix datetime.parser._parser,_timelex to correctly tokenize default tz…

04d9c2e

… format and add tests

Adapt tokenizer to optinally disallow behaviour introduced by bc69c3f

1b678d4

pganssle reviewed Dec 11, 2017

View reviewed changes

jbrockmendel reviewed Dec 13, 2017

View reviewed changes

jbrockmendel mentioned this pull request Dec 13, 2017

Prepocessing results of timelex.split #541

Closed

Implement regexp tokenizer to leave timelex unchanged and minor style…

4f9e235

… changes

pablogsal commented Dec 14, 2017

View reviewed changes

jbrockmendel reviewed Dec 14, 2017

View reviewed changes

pablogsal added 2 commits December 14, 2017 14:24

Eliminate unnecessary filtering

b15f38a

Revert b15f38a

3428f2f

pganssle reviewed Dec 14, 2017

View reviewed changes

pganssle mentioned this pull request Dec 18, 2017

tzrange offsets can be outside of -24:00/24:00 range #594

Open

pganssle added 2 commits December 18, 2017 13:10

Fix 'signal' so it multiplies by the DST change

38dcf0f

Parametrize tzstr tests, plus some PEP8 fixes and adding tzstr mark

60ff4aa

Merge branch 'master' into issue259

7b885ef

pganssle merged commit 1255d8a into dateutil:master Dec 19, 2017

pganssle mentioned this pull request Mar 11, 2018

Release 2.7.0 #627

Closed

Make tz.tzstr fail if an invalid GNU tz string is provided #581

Make tz.tzstr fail if an invalid GNU tz string is provided #581

Conversation

pablogsal commented Dec 10, 2017

pganssle commented Dec 10, 2017

pganssle left a comment

Choose a reason for hiding this comment

pablogsal commented Dec 10, 2017 • edited Loading

pablogsal commented Dec 10, 2017 • edited Loading

pganssle commented Dec 10, 2017

pablogsal commented Dec 10, 2017 • edited Loading

pablogsal commented Dec 10, 2017 • edited Loading

pablogsal commented Dec 10, 2017

jbrockmendel commented Dec 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 13, 2017

Choose a reason for hiding this comment

pganssle commented Dec 13, 2017

Choose a reason for hiding this comment

jbrockmendel left a comment

Choose a reason for hiding this comment

pganssle commented Dec 13, 2017

pablogsal commented Dec 14, 2017 • edited Loading

pablogsal Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

pganssle Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pablogsal Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pganssle commented Dec 14, 2017

Choose a reason for hiding this comment

jbrockmendel commented Dec 15, 2017

pganssle commented Dec 15, 2017

pganssle commented Dec 18, 2017

pganssle commented Dec 19, 2017

pablogsal commented Dec 10, 2017 •

edited

Loading

pablogsal commented Dec 10, 2017 •

edited

Loading

pablogsal commented Dec 10, 2017 •

edited

Loading

pablogsal commented Dec 10, 2017 •

edited

Loading

pablogsal commented Dec 14, 2017 •

edited

Loading

pablogsal Dec 14, 2017 •

edited

Loading

pganssle Dec 14, 2017 •

edited

Loading

pablogsal Dec 14, 2017 •

edited

Loading

pablogsal Dec 14, 2017 •

edited

Loading

pablogsal Dec 14, 2017 •

edited

Loading