-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dedicated isoparser #489
Conversation
df12c51
to
48c8ecf
Compare
.travis.yml
Outdated
|
||
install: | ||
- pip install six | ||
- ./ci_tools/retry.sh python updatezinfo.py | ||
|
||
script: | ||
- coverage run --omit=setup.py,dateutil/test/* setup.py test | ||
- coverage run --omit=setup.py,dateutil/test/* -m pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's separate this out, as it has caused heisenbugs in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate it out how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if by "heisenbugs" you mean that issue with your other PR, I think that may have been to do with the yield
method of creating tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate it out how?
I'm suggesting the move to pytest should be done in an independent PR (I think you may have already updated this below)
may have been to do with the yield method of creating tests
Good to know, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah, maybe I'll cherry-pick the pytest commit and rebase this on master after we do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're doing cherry-picking for small fairly-immediate changes, making parser
a directory is worthwhile. It'll make for much smaller diffs in other PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. Probably a good idea, but I think I'll just wrap up the feature as is and do one big rebase at the end, since the "move the parser to a folder" wasn't its own commit (stupidly).
I don't think this would be particularly useful. It's hard to directly compare what they are doing with what we are doing, but I don't think it would be terribly difficult to get something as fast as they have if we drop down to C, and they don't actually fully support ISO strings. Consider the following edge cases: import numpy as np
iso_strings = ['2014-W03', '2014-W03-3', # ISO week
'2014-224', # Ordinal day
'2014-01-01T00-12:15', # Timezone aware
'20140301T24:00' # Midnight as 24:00
]
for iso_string in iso_strings:
try:
np.datetime64(iso_string)
print('Sucess: {}'.format(iso_string))
except ValueError:
pass The only one that parses correctly is the timezone aware one, and that's apparently deprecated:
I don't think it should be terribly difficult to eventually add a C implementation here that's quite fast, but even the pure python version I've got so far (with no profiling to look for bottlenecks, mind you), is already faster than |
The items that will be left for a separate PR:
|
Fair enough. I mention it because I've been spending a lot of time recently working on pandas._libs.tslibs so am pretty familiar with it. If we implement something strictly better so that we can take that off their plate, that would be a big win. Also note that pandas altered the numpy version to support tz-aware returns. |
Can you clarify what you mean by offsets and periods? Again my train of thought goes to |
dateutil/parser/isoparser.py
Outdated
__all__ = ["isoparse", "Isoparser"] | ||
|
||
|
||
class Isoparser(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something you expect users to subclass regularly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda one of the open questions, honestly. I've been a bit haphazard about the design of this class because I keep vacillating between "users will create an instance" and "users will create a subclass", which is why half the methods are classmethod
and half are instance methods.
I think I'm partially being biased by the parser, where my hope was that downstream users would subclass parser
to allow customization of behavior, but even in the parser, I'm thinking that might not be the right design for this particular sort of task.
After giving it a bit of thought, I believe that since ISO-8601 has a very clear set of options it supports, it might be best if it we refactored this to be more like parser
/parserinfo
, where isoparserinfo
would have some limited feature flags like ordinal_dates
and weeks
(so that there's not this artificial division between "common" and "uncommon"). The main downside to adding a ton of feature flags may be speed, but we'll benchmark that when we come to it.
That said, in terms of subclasses, I think that when #449 is done, I'll probably want an internal subclass that returns a PartialDatetime
instead of datetime
(that way the default_year
logic will be unnecessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] but even in the parser, I'm thinking that might not be the right design for this particular sort of task.
Yah, I've gone along with the make-everything-a-method approach since there isn't a ton of downside, but I think YAGNI probably applies unless we have good reason to believe users actually are subclassing.
(This may be a discussion that belongs elsewhere, don't want to derail this PR's discussion)
Especially for e.g. parser._build_tzinfo
, I have a hard time imagining the use case where it is easier to subclass than it is to pass a callable for tzinfos
. So I might start pushing to be more selective about does-this-really-need-to-be-a-method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, no one should be subclassing parser._build_tzinfo
anyway, since it's a private method. I don't see that there's any downside to making these things methods, and it namespaces the methods properly, at the very least.
The reason that my vision for the future of the parser
was inheriting from parser
was that part of the point of breaking things up into small pieces is that it means there's a lot of seams in the code where behavior can be overridden to handle weird non-standard formats. We can discuss this in another location, but the only reason I've soured on that a bit is that I think a lot of the information about parser state is inherently non-local, so the coupling between all these tiny functions is going to necessarily be high (can't know if some 4-digit number is a year, or hour-minute pair without knowing that it's immediately followed by AM/PM, etc).
I'm now thinking that the right thing to do for the flexible parser is to parse to some intermediate representation of the format, then a second "constraint" pass to relieve any tensions (e.g. two possible days - choose the more likely one), then your intermediate representation can be applied to any string (not just the one you used to create it). That would help in the overwhelmingly common case of "I need to parse a bunch of dates in the same format", and it creates two very flexible seam points at which users can make decisions about what gets parsed.
In any case, that doesn't mean I want bare functions sitting around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, that doesn't mean I want bare functions sitting around.
OK, I evidently mis-interpreted your previous comment. Never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, just got through a third of the review. Will try to get it finished tonight :S
dateutil/parser/__init__.py
Outdated
|
||
__all__ = ['parse', 'parser', 'parserinfo', | ||
'isoparse', 'Isoparser', | ||
'InvalidDatetimeError', 'InvalidDateError', 'InvalidTimeError'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to bring this to this namespace as well, don't you? It is neither a module nor a local variable.
I think doing this will break from dateutil.parser import *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add some tests to verify current behaviour on the import *
and general imports before moving it :X
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll add that and add some tests.
dateutil/parser/isoparser.py
Outdated
A single character that separates date and time portions | ||
:param default_year: | ||
The default year to be used as the basis for parsing the uncommon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to document the default is the current year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I put the default year documentation in isoparse
, but I should also have it here.
dateutil/parser/isoparser.py
Outdated
The default year to be used as the basis for parsing the uncommon | ||
no-year date formats. | ||
""" | ||
if (len(sep.encode('utf-8')) != 1 or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need to do encode at all.
You can just check the length, and if you are worried on unicode characters you are already using ord
that will give you the unicode point.
>>> ord(u'语')
35821
Raising it as in python2 you will be calling encode in bytestrings (should not cause any issue unless using unicode characters, but quite weird)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call. I added the ord
check after the encode
portion and didn't realize it obviated the need to encode.
I'm a bit uncomfortable with "2014-224" format. That seems like an easy way to get false-positives. Do you have a mechanism in mind for letting users opt-in/opt-out of accepting these less-common formats? |
@jbrockmendel Yes. Technically this code already has one - if you parse with I think the final interface will be partially driven by how easy it is to make these feature flags performant in both Python and Cython (as I intend for |
That seems like a reasonable approach.
I think this may also bear on my feeling of rushedness mentioned over in the pandas thread. This represents a pretty big shift in terms of build/distribution. That said, the rust bit in particular would be very neat. |
@jbrockmendel To be very clear, I am not planning on releasing a compiled backend as part of the 2.7.0 release, but I have been using this as a test module for doing that. I've implemented parts of the algorithm in Rust and in Cython for some basic profiling and I'm using those results to try to guide what the public part of the API should look like in Python. I foresee a compiled backend release happening in a few stages, either I'll start with a separate With regards to the use of Rust, I haven't got the Rust part binding to Python quite yet, but I'm not optimistic about its performance, at least partially because I don't see an obvious way to have Rust call the C Rust also has a few other strikes against it. First, it will be harder to accept PRs that would have an effect in the Rust code, since contributors will likely find it harder to work with than Cython or C. Second, there's not nearly the same support for Rust<->Python bindings that there is for C<->Python bindings. Third (and this may be similar to the second reason), building this thing cross-platform is going to be hard enough, adding in a dependency on rust/cargo might make things even more complicated. |
From profiling, it seems like one of the biggest slowdowns in parsing time zone aware zones is constructing Edit: Singleton constructor in #504 |
ccd00f5
to
ab6d81f
Compare
For the last bullet point, I think I'm going to drop all class methods and go with all instance methods, then add aliases for Unfortunately, there's still one big work item, which is that the That said, this is becoming a pretty long-lived feature branch. I'm thinking maybe in this branch we'll drop the @jbrockmendel @mariocj89 Thoughts? |
I said several days ago I would review this promptly and then didn't. Double-plus will do so today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, additionally it seems you only hit 97.00% coverage on the new isoparse file, might be good checking what lines were missed (link in codecov is broken)
dateutil/parser/isoparser.py
Outdated
len_str = len(dt_str) | ||
components = [1, 1, 1] | ||
|
||
pos = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this line or change pos=4 to pos +=4
### | ||
# Invalid ISO strings | ||
@pytest.mark.parametrize('isostr,exception', [ | ||
('201', ValueError), # ISO string too short |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it fails nicely but would be good to get a test case for dates as:
YYYY-MM-D as 2016-12-1
For future development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think YYYY-MM-D
is a supported format, is it?
Little bit late, but LGTM. I'd like to implement robust profiling before going too much further down this road (i.e. into cython-land). The numpy/pandas iso8601-esque parsers catch "today" and "now". Are those part of the spec? |
Most decidedly not. The spec is about very precise specifications of dates and times. |
Here's a first pass at a dedicated iso parser per #424.
This involves a reorganization of the code for
parser
into its own little submodule, to avoid the explosion ofparser.py
. I think there will likely be some problems for downstream users who are using private interfaces, but I'm not too worried about that.I've got a rough sketch for a Cython version of this that is ~20x faster, but it seems like even this pure-python version beats a
datetime.strptime
call:I may have gone a bit overboard with the test parametrization. More tests need to be added as well - I'm not, for example, testing any failure conditions.
Remaining items to do:
parse_isodate
andparse_isotime
functions.Deferred:
isoparse
.Ping @jbrockmendel @wagner-certat @mariocj89