Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tz.tzstr fail if an invalid GNU tz string is provided #581

Merged
merged 9 commits into from
Dec 19, 2017

Conversation

pablogsal
Copy link
Contributor

This pr fixes #259. As discussed with @pganssle in the dateutil sprint in London, the fix consist in modifiying dateutil.parser._parser._tzparser.parse to check if all tokens generated from the input string have been consumed (except separators that are assumed to be "," and ":" as specified here). To have some control (and help future debugging) on the tokens that are not used, a mark of used_tokens is used internally to keep track of this. To avoid making the parser raise and therefore modifiying the current behaviour, a new slotted field was added to dateutil.parser._parser._tzparser._result that indicates if all tokens have been consumed or not. This is used in dateutil.tz.tz.tzstr to raise accordingly in the case that this check fails (so not all tokens have been consumed`.

8 test of the test suite for tz.tz.tzstr were failing because they were using invalid strings and now the call to tz.tz.tzstr raises. These test were using "default" arguments, that were intented to use the default values that the parser, tzstr and tzrange constructs. To fix these test, the input string has been changed for a GNUtz-compatible equivalent.

@pablogsal pablogsal force-pushed the issue259 branch 2 times, most recently from 8b3c14e to 3431b26 Compare December 10, 2017 15:35
@pganssle
Copy link
Member

@pablogsal Thanks Pablo!

@jbrockmendel Since you're the resident parser expert, do you want to take a look? I gave it a cursory glance and discussed this in person when it was in development.

@pganssle pganssle added this to the 2.7.0 milestone Dec 10, 2017
Copy link
Member

@pganssle pganssle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablogsal Do you think you can add tests to get the diff coverage up to 100%?

Also, I notice in the documentation he lists a bunch of supported formats, some of which seem to match the formats that are being changed here. We should probably either add support for those formats (which can be a separate PR) or change the documentation.

I'm also a little freaked out that so many tests had to be changed - I'm curious to know if they ever worked and if so when they broke (I'll take a look at this later). The fact that these formats were explicitly tested and mentioned in the documentation seems like pretty strong evidence that we shouldn't drop support for them, but if I understand this PR correctly, it seems we've never actually supported them (as far as I can tell), which makes me think that they're not widely used.

@pablogsal
Copy link
Contributor Author

pablogsal commented Dec 10, 2017

@pganssle The majority of those are default initialized in the parser, so the parser was ignoring the majority of the string and default initializing those. The "worked" but the string used to create the objects was not used in its majority. Therefore this format is not being correctly parsed. As a corollary, no mater what do you write in this format, the result will be the same. Example:

>>> import dateutil.tz
>>> s = "EST5EDT,4,1,0,7200,10,-1,0,7200,3600"
>>> s2 = "EST5EDT,4,4,3,7200,11,-1,0,7200,3600"
>>> tz1 = dateutil.tz.tzstr(s)
>>> tz2 = dateutil.tz.tzstr(s2)
>>> tz1.__dict__
{'_s': 'EST5EDT,4,1,0,7200,10,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True}
>>> tz2.__dict__
{'_s': 'EST5EDT,4,4,3,7200,11,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True}

The string formats were selected out of the previous values of the tests that have been changed. Notice that both generate relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)) and relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)) as start_delta and end_delta. So if this format was supported at some point, it seems incorrectly parsed.

@pablogsal
Copy link
Contributor Author

pablogsal commented Dec 10, 2017

After investing some time further analyzing whats going on I think I have spotted a major problem. From this string (not GNU):

"GMT0BST,3,0,30,3600,10,0,26,7200"

The lexer creates this tokens:

['GMT', '0', 'BST', ',', '3', ',', '0', ',', '30.3600', ',', '10.0', ',', '26.7200']

And the parser NEVER enters the condition for start parsing this format:

elif (8 <= l.count(',') <= 9 and not [y for x in l[i:] if x != ','  for y in x if y not in "0123456789"])

This is because the lexer has created "26.7200" and "30.3600" as tokens insted of using [26,7200] or [30,3600]. The current implementation of the parser does not like this.

This is even worse because the tokenization is not coherent between the start and end date. From the last tokenization, the start month is "3" but the end month is "10.0". It seems (just an impression) that the tokenizer was not prepared to parse this strings.

@pganssle
Copy link
Member

Interesting. This touches on a discussion @jbrockmendel and I were having in #541.

I'm going to have to go through the history to see when exactly those tests broke, but I don't think that fixing this issue should block the acceptance of this PR, which solves a different goal anyway.

For now can you add back in the old tests and just use @pytest.mark.xfail on them?

I think just restoring the old tests and marking them xfail + adding tests that cover the new code you added should be good, though I'm interested to hear Brock's comments.

@pablogsal
Copy link
Contributor Author

pablogsal commented Dec 10, 2017

@pganssle In 04d9c2e I have fixed the tokenizer and the parser and I have added test that test though multiple new (and previously uncovered) branches. The old tests are passing with the previous values (so they remain unchanged compared with the base branch) and they are testing correctly now because the parsing is working and being enforced. So no need to xfail them. It has been a big fight for my relax Sunday but I it seems is working as intended. 😄

@pablogsal
Copy link
Contributor Author

pablogsal commented Dec 10, 2017

The only problem now is that ParserTest.testPythonLoggerFormat is failing. This is due to a change in the tokenizer. This change was introduced in bc69c3f (which is the commit that probably broke the tests) and it seems is what broke the parser for the default strings because is making the tokenizer tokenize as ['GMT', '0', 'BST', ',', '3', ',', '0', ',', '30.3600', ',', '10.0', ',', '26.7200'] (the bad format). I will adapt the tokenizer to admit both possibilities.

@pablogsal
Copy link
Contributor Author

1b678d4 Allows both behaviours and now everything is passing and the tests remain unchanged.

@jbrockmendel
Copy link
Contributor

I'll take a close look at this but won't be able to get to it until tomorrow afternoon.

@@ -57,7 +57,7 @@ class _timelex(object):
# Fractional seconds are sometimes split by a comma
_split_decimal = re.compile("([.,])")

def __init__(self, instream):
def __init__(self, instream, join_numeric_separated_by_comma=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I don't love this. _timelex is semi-public, and this is not something I would ever put in a public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But we need either to select one behavior or implement some flag in this fashion or completely separate de dependencies. What is your preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don't like this design I put this so (1) you can identify easily what the problem is and (2) we can discuss better a possible implementation of the fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... I think I see what's going on here.

I find it a bit strange that _tzparser re-uses the lexer like this. I see three other possible ways of going here:

  1. Split out _timelexer and _tzlexer into two different lexers with independent rules.
  2. Do some sort of token splitting downstream based on the context.
  3. Reverse the commit that keeps the comma token together and dynamically re-join it as necessary.

None of them are particularly satisfying.

Another way to go with your approach is to come up with some more satisfying name for that flag - if there's some unifying theme for the cases where you'd want it to be true, then I'm more OK with adding the flag.

The thing is, 2 and 3 seem wrong because they violate separation of concerns, but we may need to do either 2 or 3 if we want to also solve #203 and other comma-related problems caused by this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to move meanwhile we implement 1,2 or 3, we can use a new private API to _timelex that activates the flag so this is invisible to the end user. Is not good, but is lightweight and it will keep this PR logically separated from bigger fixes. I think 3 underscores will be enough for privacy (_parser._timelex._privatestuff). 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having gone through the comments but not yet the code: is there a viable solution here that does not require addressing the _timelex design? If so, I expect that is the best short-term solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This touches on #541, and as @pganssle says ideally that design question should be addressed separately.

IIRC the comma-as-decimal thing is mostly there for the python logging format. If that is correct, we can more strictly check for something like '(?P<h>\d{1,2}):(?P<m>\d{2})(:(?P<s>\d{2}([,\.]\d+)?)) (though likely not using regex)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel There are a large number of possible hacky workarounds we could do, but it's basically the same problem as with this PR - right now we have some tokens that are combined when they shouldn't be, so the options are:

  1. Add an additional tokenizer step in the parser that splits these aggressively-combined tokens (no direct effect on _timelex, but does couple to the current behavior)
  2. Make it so _timelex doesn't combine these anymore and add a re-combination step in the parser portion.
  3. Some variation on Pablo's solution, e.g. "tell _timelex which behavior we want when we call it" - this can either be a flag like what Pablo is doing or we subclass _timelex to make a _tzlex, and the two vary depending on this behavior.

I don't see any other obvious solution, since in some contexts you want the tokens combined and in some contexts you want the tokens split.

def split(cls, s):
return list(cls(s))
def split(cls, s, join_numeric_separated_by_comma=True):
return list(cls(s,join_numeric_separated_by_comma))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8 whitespace after comma

@@ -77,6 +77,7 @@ def __init__(self, instream):
self.charstack = []
self.tokenstack = []
self.eof = False
self.join_numeric_separated_by_comma = join_numeric_separated_by_comma
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment for why this option exists, namely for the python logging format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just the python logging format? I think that it's also for the ISO-8601 format, which technically (though not in practice) prefers comma as the fractional separator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, I had no idea. I guess I've seen that used in conjunction with Euros, e.g. "€ 29.123,45"

Does the ISO usage need to be for any numbers or is it just for fractional seconds? i.e. will the check-for-colons above work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's for all fractional time components, but I think basically no one uses anything except fractional seconds. Still, this is technically a valid ISO-8601 datetime:

2014-01-01T00,50 -> datetime(2014, 1, 1, 0, 30)

@jbrockmendel
Copy link
Contributor

Is there a spec that can be linked to describing how tzparser is intended to work?

else:
offattr = "dstoffset"
res.dstabbr = "".join(l[i:j])
for ii in range(j):
used_tokens[ii] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is done in both branches of this if/else. Can you de-indent and just do it once?

@pganssle
Copy link
Member

@jbrockmendel That's part of the problem. _tzparser itself is actually private, but it's used in a few places internally. The closest thing to a "spec" is the examples in the examples documentation.

If we were to take tzstr's documentation at its word that it only supports GNU TZ strings, then we could use that as a spec, but the documentation and examples seem to indicate that it actually supports other formats as well.

@@ -1464,6 +1494,7 @@ def parse(self, tzstr):
except (IndexError, ValueError, AssertionError):
return None

res.unused_tokens = not {token for token,is_used in zip(l,used_tokens) if not is_used}.issubset({",",":"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took some staring out to figure out. If nothing else, please break this into a couple of lines starting with unused_tokens = {token for token,is_used in zip(l,used_tokens) if not is_used}.

Related, the name res.unused_tokens suggest that it is listlike giving the tokens themselves. If this is a bool, maybe something like res.any_unused_tokens?

More broadly, I'd suggest the following:

Implement used_tokens as used_idxs mirroring skipped_idxs in _parser._parse. Then each occurrence of used_tokens[i] = True becomes used_idxs.append(i). Then {token for token,is_used in zip(l,used_tokens) if not is_used} becomes:

unused_idxs = set(range(len_l)).difference(used_idxs)
unused_tokens = {l[n] for n in unused_idxs}

Copy link
Contributor

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style comments, generally LGTM conditional upon finding a solution to the _timelex issue.

Is using timelex.split even necessary here? Would just splitting on commas+colons+slashes+whitespace get the job done?

@pganssle
Copy link
Member

Is using timelex.split even necessary here?

This is kinda what I was thinking. It's a little weird that _timelex is being reused for this purpose.

@pablogsal
Copy link
Contributor Author

pablogsal commented Dec 14, 2017

@pganssle @jbrockmendel In 4f9e235 I have implemented a simple regexp lexer that makes the trick (and is 35x +- 1.2 times faster) so this leaves _timelex and its problems unchanged and therefore is a problem for another day/PR. Notice that this lexer is not as straightforward as splitting on commas+colons+slashes+whitespace because you have to split into numeric/non numeric as well. In addition, I have addressed all the style feedback that you indicated. 😄

Also tzparse is now 97.56 % covered, which is a noticeable improvement compared with the previous situation.

i += 1
if res.dstabbr:
break
else:
break


if i < len_l:
for j in range(i, len_l):
if l[j] == ';':
Copy link
Contributor Author

@pablogsal pablogsal Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I eliminate this as a possibility for an alternate separator? Is not being hit by the test and clearly does not work if you consistently use ";" as a separator. (Line 1380)

Copy link
Member

@pganssle pganssle Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish I knew, honestly. :( It's really hard to tell what people are using this for (if they're using it at all).

@@ -1313,7 +1315,8 @@ def __init__(self):

def parse(self, tzstr):
res = self._result()
l = _timelex.split(tzstr)
l = [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)',tzstr) if x]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see something like "foo,:"bar with two of [,:.] in a row is that indicative of an invalid string?

Copy link
Contributor Author

@pablogsal pablogsal Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my understanding. The parser will fail in that case. I will push for the tokenizer not doing anything else than tokenize and just let the parser naturally fail to parse a consecutive [,:.].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I ask is to see if the if x clause is necessary.

Copy link
Contributor Author

@pablogsal pablogsal Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for Python<3.6. This is due to the way re.split works in this two different versions. I have made a failing commit to prove this in b15f38a. Here are the CI results of that commit. 3428f2f reverts b15f38a.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this regular expression completely, but I think the only problem with this approach that I see is that any split based function is going to discard the tokens themselves. Are there really no constraints on what the inputs can be for valid TZ strings?

Also, with the if x filter, how would you distinguish between One,Two and One,,Two?

In [6]: [x for x in re.split(r'[.,;]', 'One,Two') if x]
Out[6]: ['One', 'Two']

In [7]: [x for x in re.split(r'[.,;]', 'One,;Two') if x]
Out[7]: ['One', 'Two']

Copy link
Contributor Author

@pablogsal pablogsal Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot the capturing group. If you use the same whole expression with the capturing group included you get:

>>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,Two') if x]
['One', ',', 'Two']
>>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,;Two') if x]
['One', ',', ';', 'Two']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, that's a neat trick. I didn't realized it worked like that. Good to know.

@pganssle
Copy link
Member

@jbrockmendel Any other comments on this? I'm pretty happy with this solution so far.

@@ -1464,6 +1492,8 @@ def parse(self, tzstr):
except (IndexError, ValueError, AssertionError):
return None

unused_idxs = set(range(len_l)).difference(used_idxs)
res.any_unused_tokens = not {l[n] for n in unused_idxs}.issubset({",",":"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kinda amazing to me that this seems to be universally faster than any(l[n] not in {',', ':'} for n in unused_idxs), even when you pull the {',', ':'} creation out of the loop.

Maybe if you have some absurd number of tokens the set creation portion would get faster and this would get slow.

@jbrockmendel
Copy link
Contributor

LGTM

@pganssle
Copy link
Member

OK, I'll clean this up a bit to my exacting standards that feel too nitpicky to mention in an actual code review and then merge.

@pganssle
Copy link
Member

@pablogsal I have made some changes, mostly to actually test the behavior of tzstr rather than simply check that it doesn't throw an error. Please let me know if you think I've done it wrong. Once you approve, I think it's ready for merge.

@pganssle
Copy link
Member

I'm just going to go ahead and merge. If there are issues they can be addressed in a separate PR.

@pganssle pganssle merged commit 1255d8a into dateutil:master Dec 19, 2017
@pganssle pganssle mentioned this pull request Mar 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tzstr not properly detecting invalid strings
3 participants