-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tz.tzstr fail if an invalid GNU tz string is provided #581
Conversation
8b3c14e
to
3431b26
Compare
@pablogsal Thanks Pablo! @jbrockmendel Since you're the resident parser expert, do you want to take a look? I gave it a cursory glance and discussed this in person when it was in development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pablogsal Do you think you can add tests to get the diff coverage up to 100%?
Also, I notice in the documentation he lists a bunch of supported formats, some of which seem to match the formats that are being changed here. We should probably either add support for those formats (which can be a separate PR) or change the documentation.
I'm also a little freaked out that so many tests had to be changed - I'm curious to know if they ever worked and if so when they broke (I'll take a look at this later). The fact that these formats were explicitly tested and mentioned in the documentation seems like pretty strong evidence that we shouldn't drop support for them, but if I understand this PR correctly, it seems we've never actually supported them (as far as I can tell), which makes me think that they're not widely used.
@pganssle The majority of those are default initialized in the parser, so the parser was ignoring the majority of the string and default initializing those. The "worked" but the string used to create the objects was not used in its majority. Therefore this format is not being correctly parsed. As a corollary, no mater what do you write in this format, the result will be the same. Example: >>> import dateutil.tz
>>> s = "EST5EDT,4,1,0,7200,10,-1,0,7200,3600"
>>> s2 = "EST5EDT,4,4,3,7200,11,-1,0,7200,3600"
>>> tz1 = dateutil.tz.tzstr(s)
>>> tz2 = dateutil.tz.tzstr(s2)
>>> tz1.__dict__
{'_s': 'EST5EDT,4,1,0,7200,10,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True}
>>> tz2.__dict__
{'_s': 'EST5EDT,4,4,3,7200,11,-1,0,7200,3600', '_std_abbr': 'EST', '_dst_abbr': 'EDT', '_std_offset': datetime.timedelta(-1, 68400), '_dst_offset': datetime.timedelta(-1, 72000), '_start_delta': relativedelta(hours=+2, month=4, day=1, weekday=SU(+1)), '_end_delta': relativedelta(hours=+1, month=10, day=31, weekday=SU(-1)), '_dst_base_offset_': datetime.timedelta(0, 3600), 'hasdst': True} The string formats were selected out of the previous values of the tests that have been changed. Notice that both generate |
After investing some time further analyzing whats going on I think I have spotted a major problem. From this string (not GNU): "GMT0BST,3,0,30,3600,10,0,26,7200" The lexer creates this tokens: ['GMT', '0', 'BST', ',', '3', ',', '0', ',', '30.3600', ',', '10.0', ',', '26.7200'] And the parser NEVER enters the condition for start parsing this format: elif (8 <= l.count(',') <= 9 and not [y for x in l[i:] if x != ',' for y in x if y not in "0123456789"]) This is because the lexer has created "26.7200" and "30.3600" as tokens insted of using [26,7200] or [30,3600]. The current implementation of the parser does not like this. This is even worse because the tokenization is not coherent between the start and end date. From the last tokenization, the start month is "3" but the end month is "10.0". It seems (just an impression) that the tokenizer was not prepared to parse this strings. |
Interesting. This touches on a discussion @jbrockmendel and I were having in #541. I'm going to have to go through the history to see when exactly those tests broke, but I don't think that fixing this issue should block the acceptance of this PR, which solves a different goal anyway. For now can you add back in the old tests and just use @pytest.mark.xfail on them? I think just restoring the old tests and marking them xfail + adding tests that cover the new code you added should be good, though I'm interested to hear Brock's comments. |
… format and add tests
@pganssle In 04d9c2e I have fixed the tokenizer and the parser and I have added test that test though multiple new (and previously uncovered) branches. The old tests are passing with the previous values (so they remain unchanged compared with the base branch) and they are testing correctly now because the parsing is working and being enforced. So no need to xfail them. It has been a big fight for my relax Sunday but I it seems is working as intended. 😄 |
The only problem now is that ParserTest.testPythonLoggerFormat is failing. This is due to a change in the tokenizer. This change was introduced in bc69c3f (which is the commit that probably broke the tests) and it seems is what broke the parser for the default strings because is making the tokenizer tokenize as |
1b678d4 Allows both behaviours and now everything is passing and the tests remain unchanged. |
I'll take a close look at this but won't be able to get to it until tomorrow afternoon. |
dateutil/parser/_parser.py
Outdated
@@ -57,7 +57,7 @@ class _timelex(object): | |||
# Fractional seconds are sometimes split by a comma | |||
_split_decimal = re.compile("([.,])") | |||
|
|||
def __init__(self, instream): | |||
def __init__(self, instream, join_numeric_separated_by_comma=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I don't love this. _timelex
is semi-public, and this is not something I would ever put in a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But we need either to select one behavior or implement some flag in this fashion or completely separate de dependencies. What is your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I don't like this design I put this so (1) you can identify easily what the problem is and (2) we can discuss better a possible implementation of the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I think I see what's going on here.
I find it a bit strange that _tzparser
re-uses the lexer like this. I see three other possible ways of going here:
- Split out
_timelexer
and_tzlexer
into two different lexers with independent rules. - Do some sort of token splitting downstream based on the context.
- Reverse the commit that keeps the comma token together and dynamically re-join it as necessary.
None of them are particularly satisfying.
Another way to go with your approach is to come up with some more satisfying name for that flag - if there's some unifying theme for the cases where you'd want it to be true, then I'm more OK with adding the flag.
The thing is, 2 and 3 seem wrong because they violate separation of concerns, but we may need to do either 2 or 3 if we want to also solve #203 and other comma-related problems caused by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to move meanwhile we implement 1,2 or 3, we can use a new private API to _timelex
that activates the flag so this is invisible to the end user. Is not good, but is lightweight and it will keep this PR logically separated from bigger fixes. I think 3 underscores will be enough for privacy (_parser._timelex._privatestuff
). 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having gone through the comments but not yet the code: is there a viable solution here that does not require addressing the _timelex
design? If so, I expect that is the best short-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This touches on #541, and as @pganssle says ideally that design question should be addressed separately.
IIRC the comma-as-decimal thing is mostly there for the python logging format. If that is correct, we can more strictly check for something like '(?P<h>\d{1,2}):(?P<m>\d{2})(:(?P<s>\d{2}([,\.]\d+)?))
(though likely not using regex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel There are a large number of possible hacky workarounds we could do, but it's basically the same problem as with this PR - right now we have some tokens that are combined when they shouldn't be, so the options are:
- Add an additional tokenizer step in the parser that splits these aggressively-combined tokens (no direct effect on
_timelex
, but does couple to the current behavior) - Make it so
_timelex
doesn't combine these anymore and add a re-combination step in the parser portion. - Some variation on Pablo's solution, e.g. "tell
_timelex
which behavior we want when we call it" - this can either be a flag like what Pablo is doing or we subclass_timelex
to make a_tzlex
, and the two vary depending on this behavior.
I don't see any other obvious solution, since in some contexts you want the tokens combined and in some contexts you want the tokens split.
dateutil/parser/_parser.py
Outdated
def split(cls, s): | ||
return list(cls(s)) | ||
def split(cls, s, join_numeric_separated_by_comma=True): | ||
return list(cls(s,join_numeric_separated_by_comma)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8 whitespace after comma
dateutil/parser/_parser.py
Outdated
@@ -77,6 +77,7 @@ def __init__(self, instream): | |||
self.charstack = [] | |||
self.tokenstack = [] | |||
self.eof = False | |||
self.join_numeric_separated_by_comma = join_numeric_separated_by_comma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment for why this option exists, namely for the python logging format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just the python logging format? I think that it's also for the ISO-8601
format, which technically (though not in practice) prefers comma as the fractional separator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat, I had no idea. I guess I've seen that used in conjunction with Euros, e.g. "€ 29.123,45"
Does the ISO usage need to be for any numbers or is it just for fractional seconds? i.e. will the check-for-colons above work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's for all fractional time components, but I think basically no one uses anything except fractional seconds. Still, this is technically a valid ISO-8601 datetime:
2014-01-01T00,50
-> datetime(2014, 1, 1, 0, 30)
Is there a spec that can be linked to describing how tzparser is intended to work? |
dateutil/parser/_parser.py
Outdated
else: | ||
offattr = "dstoffset" | ||
res.dstabbr = "".join(l[i:j]) | ||
for ii in range(j): | ||
used_tokens[ii] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is done in both branches of this if/else. Can you de-indent and just do it once?
@jbrockmendel That's part of the problem. If we were to take |
dateutil/parser/_parser.py
Outdated
@@ -1464,6 +1494,7 @@ def parse(self, tzstr): | |||
except (IndexError, ValueError, AssertionError): | |||
return None | |||
|
|||
res.unused_tokens = not {token for token,is_used in zip(l,used_tokens) if not is_used}.issubset({",",":"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took some staring out to figure out. If nothing else, please break this into a couple of lines starting with unused_tokens = {token for token,is_used in zip(l,used_tokens) if not is_used}
.
Related, the name res.unused_tokens
suggest that it is listlike giving the tokens themselves. If this is a bool, maybe something like res.any_unused_tokens
?
More broadly, I'd suggest the following:
Implement used_tokens
as used_idxs
mirroring skipped_idxs
in _parser._parse
. Then each occurrence of used_tokens[i] = True
becomes used_idxs.append(i)
. Then {token for token,is_used in zip(l,used_tokens) if not is_used}
becomes:
unused_idxs = set(range(len_l)).difference(used_idxs)
unused_tokens = {l[n] for n in unused_idxs}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style comments, generally LGTM conditional upon finding a solution to the _timelex issue.
Is using timelex.split even necessary here? Would just splitting on commas+colons+slashes+whitespace get the job done?
This is kinda what I was thinking. It's a little weird that |
@pganssle @jbrockmendel In 4f9e235 I have implemented a simple regexp lexer that makes the trick (and is 35x +- 1.2 times faster) so this leaves Also |
i += 1 | ||
if res.dstabbr: | ||
break | ||
else: | ||
break | ||
|
||
|
||
if i < len_l: | ||
for j in range(i, len_l): | ||
if l[j] == ';': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I eliminate this as a possibility for an alternate separator? Is not being hit by the test and clearly does not work if you consistently use ";" as a separator. (Line 1380)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wish I knew, honestly. :( It's really hard to tell what people are using this for (if they're using it at all).
@@ -1313,7 +1315,8 @@ def __init__(self): | |||
|
|||
def parse(self, tzstr): | |||
res = self._result() | |||
l = _timelex.split(tzstr) | |||
l = [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)',tzstr) if x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see something like "foo,:"bar
with two of [,:.]
in a row is that indicative of an invalid string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my understanding. The parser will fail in that case. I will push for the tokenizer not doing anything else than tokenize and just let the parser naturally fail to parse a consecutive [,:.]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I ask is to see if the if x
clause is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this regular expression completely, but I think the only problem with this approach that I see is that any split
based function is going to discard the tokens themselves. Are there really no constraints on what the inputs can be for valid TZ strings?
Also, with the if x
filter, how would you distinguish between One,Two
and One,,Two
?
In [6]: [x for x in re.split(r'[.,;]', 'One,Two') if x]
Out[6]: ['One', 'Two']
In [7]: [x for x in re.split(r'[.,;]', 'One,;Two') if x]
Out[7]: ['One', 'Two']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot the capturing group. If you use the same whole expression with the capturing group included you get:
>>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,Two') if x]
['One', ',', 'Two']
>>> [x for x in re.split(r'([,:.]|[a-zA-Z]+|[0-9]+)','One,;Two') if x]
['One', ',', ';', 'Two']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, that's a neat trick. I didn't realized it worked like that. Good to know.
@jbrockmendel Any other comments on this? I'm pretty happy with this solution so far. |
@@ -1464,6 +1492,8 @@ def parse(self, tzstr): | |||
except (IndexError, ValueError, AssertionError): | |||
return None | |||
|
|||
unused_idxs = set(range(len_l)).difference(used_idxs) | |||
res.any_unused_tokens = not {l[n] for n in unused_idxs}.issubset({",",":"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda amazing to me that this seems to be universally faster than any(l[n] not in {',', ':'} for n in unused_idxs)
, even when you pull the {',', ':'}
creation out of the loop.
Maybe if you have some absurd number of tokens the set creation portion would get faster and this would get slow.
LGTM |
OK, I'll clean this up a bit to my exacting standards that feel too nitpicky to mention in an actual code review and then merge. |
@pablogsal I have made some changes, mostly to actually test the behavior of |
I'm just going to go ahead and merge. If there are issues they can be addressed in a separate PR. |
This pr fixes #259. As discussed with @pganssle in the dateutil sprint in London, the fix consist in modifiying
dateutil.parser._parser._tzparser.parse
to check if all tokens generated from the input string have been consumed (except separators that are assumed to be "," and ":" as specified here). To have some control (and help future debugging) on the tokens that are not used, a mark of used_tokens is used internally to keep track of this. To avoid making the parser raise and therefore modifiying the current behaviour, a new slotted field was added todateutil.parser._parser._tzparser._result
that indicates if all tokens have been consumed or not. This is used indateutil.tz.tz.tzstr
to raise accordingly in the case that this check fails (so not all tokens have been consumed`.8 test of the test suite for
tz.tz.tzstr
were failing because they were using invalid strings and now the call totz.tz.tzstr
raises. These test were using "default" arguments, that were intented to use the default values that the parser, tzstr and tzrange constructs. To fix these test, the input string has been changed for a GNUtz-compatible equivalent.