Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dedicated isoparser #489

Merged
merged 35 commits into from
Dec 6, 2017
Merged

Implement dedicated isoparser #489

merged 35 commits into from
Dec 6, 2017

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Nov 1, 2017

Here's a first pass at a dedicated iso parser per #424.

This involves a reorganization of the code for parser into its own little submodule, to avoid the explosion of parser.py. I think there will likely be some problems for downstream users who are using private interfaces, but I'm not too worried about that.

I've got a rough sketch for a Cython version of this that is ~20x faster, but it seems like even this pure-python version beats a datetime.strptime call:

In [1]: from dateutil.parser import isoparse
   ...: from datetime import datetime
   ...: 
   ...: dt = '2014-02-28T14:25:03.446990'
   ...: 

In [2]: %timeit isoparse(dt)
10000 loops, best of 3: 20.7 µs per loop

In [3]: %timeit datetime.strptime(dt, '%Y-%m-%dT%H:%M:%S.%f')
The slowest run took 142.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 34 µs per loop

I may have gone a bit overboard with the test parametrization. More tests need to be added as well - I'm not, for example, testing any failure conditions.

Remaining items to do:

  • Explicitly convert to ASCII bytestring (there is no unicode in valid ISO-8601, and this assumption will make the Cython implementation faster)
  • Add tests for the parse_isodate and parse_isotime functions.
  • Decide what to do about the mix of classmethods and methods.

Deferred:

  • Clean up documentation so that the primary interface is more explicitly isoparse.

Ping @jbrockmendel @wagner-certat @mariocj89

Sorry, something went wrong.

@pganssle pganssle added this to the 2.7.0 milestone Nov 1, 2017
@pganssle pganssle changed the title Implement dedicated isoparser [WIP] Implement dedicated isoparser Nov 1, 2017
@pganssle pganssle force-pushed the isoparser branch 4 times, most recently from df12c51 to 48c8ecf Compare November 1, 2017 18:30
@jbrockmendel
Copy link
Contributor

I'll go over this sometime in the next couple days.

If we go down this path (which I'm generally positive on), we should consider adapting the numpy/pandas implementation.

.travis.yml Outdated

install:
- pip install six
- ./ci_tools/retry.sh python updatezinfo.py

script:
- coverage run --omit=setup.py,dateutil/test/* setup.py test
- coverage run --omit=setup.py,dateutil/test/* -m pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate this out, as it has caused heisenbugs in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate it out how?

Copy link
Member Author

@pganssle pganssle Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if by "heisenbugs" you mean that issue with your other PR, I think that may have been to do with the yield method of creating tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate it out how?

I'm suggesting the move to pytest should be done in an independent PR (I think you may have already updated this below)

may have been to do with the yield method of creating tests

Good to know, thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah, maybe I'll cherry-pick the pytest commit and rebase this on master after we do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're doing cherry-picking for small fairly-immediate changes, making parser a directory is worthwhile. It'll make for much smaller diffs in other PRs.

Copy link
Member Author

@pganssle pganssle Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. Probably a good idea, but I think I'll just wrap up the feature as is and do one big rebase at the end, since the "move the parser to a folder" wasn't its own commit (stupidly).

@pganssle
Copy link
Member Author

pganssle commented Nov 1, 2017

If we go down this path (which I'm generally positive on), we should consider adapting the numpy/pandas implementation.

I don't think this would be particularly useful. It's hard to directly compare what they are doing with what we are doing, but I don't think it would be terribly difficult to get something as fast as they have if we drop down to C, and they don't actually fully support ISO strings. Consider the following edge cases:

import numpy as np

iso_strings = ['2014-W03', '2014-W03-3',   # ISO week
               '2014-224',                 # Ordinal day
               '2014-01-01T00-12:15',      # Timezone aware
               '20140301T24:00'            # Midnight as 24:00
               ]

for iso_string in iso_strings:
    try:
        np.datetime64(iso_string)
        print('Sucess: {}'.format(iso_string))
    except ValueError:
        pass

The only one that parses correctly is the timezone aware one, and that's apparently deprecated:

/home/c0zj/anaconda3/bin/ipython:11: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
Sucess: 2014-01-01T00-12:15

I don't think it should be terribly difficult to eventually add a C implementation here that's quite fast, but even the pure python version I've got so far (with no profiling to look for bottlenecks, mind you), is already faster than strftime, which is saying something.

@pganssle
Copy link
Member Author

pganssle commented Nov 1, 2017

The items that will be left for a separate PR:

  1. Add hypothesis tests here (should be easy to test the hypothesis that dates, times and datetimes should be round-trippable with .isoformat())
  2. Add a fast C version for isoparse
  3. Add support for offsets and periods.

@jbrockmendel
Copy link
Contributor

I don't think this would be particularly useful. It's hard to directly compare what they are doing with what we are doing, but I don't think it would be terribly difficult to get something as fast as they have if we drop down to C, and they don't actually fully support ISO strings

Fair enough. I mention it because I've been spending a lot of time recently working on pandas._libs.tslibs so am pretty familiar with it. If we implement something strictly better so that we can take that off their plate, that would be a big win. Also note that pandas altered the numpy version to support tz-aware returns.

@jbrockmendel
Copy link
Contributor

Add support for offsets and periods.

Can you clarify what you mean by offsets and periods? Again my train of thought goes to pd.offsets and pd.Period.

__all__ = ["isoparse", "Isoparser"]


class Isoparser(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you expect users to subclass regularly?

Copy link
Member Author

@pganssle pganssle Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda one of the open questions, honestly. I've been a bit haphazard about the design of this class because I keep vacillating between "users will create an instance" and "users will create a subclass", which is why half the methods are classmethod and half are instance methods.

I think I'm partially being biased by the parser, where my hope was that downstream users would subclass parser to allow customization of behavior, but even in the parser, I'm thinking that might not be the right design for this particular sort of task.

After giving it a bit of thought, I believe that since ISO-8601 has a very clear set of options it supports, it might be best if it we refactored this to be more like parser/parserinfo, where isoparserinfo would have some limited feature flags like ordinal_dates and weeks (so that there's not this artificial division between "common" and "uncommon"). The main downside to adding a ton of feature flags may be speed, but we'll benchmark that when we come to it.

That said, in terms of subclasses, I think that when #449 is done, I'll probably want an internal subclass that returns a PartialDatetime instead of datetime (that way the default_year logic will be unnecessary).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] but even in the parser, I'm thinking that might not be the right design for this particular sort of task.

Yah, I've gone along with the make-everything-a-method approach since there isn't a ton of downside, but I think YAGNI probably applies unless we have good reason to believe users actually are subclassing.

(This may be a discussion that belongs elsewhere, don't want to derail this PR's discussion)

Especially for e.g. parser._build_tzinfo, I have a hard time imagining the use case where it is easier to subclass than it is to pass a callable for tzinfos. So I might start pushing to be more selective about does-this-really-need-to-be-a-method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, no one should be subclassing parser._build_tzinfo anyway, since it's a private method. I don't see that there's any downside to making these things methods, and it namespaces the methods properly, at the very least.

The reason that my vision for the future of the parser was inheriting from parser was that part of the point of breaking things up into small pieces is that it means there's a lot of seams in the code where behavior can be overridden to handle weird non-standard formats. We can discuss this in another location, but the only reason I've soured on that a bit is that I think a lot of the information about parser state is inherently non-local, so the coupling between all these tiny functions is going to necessarily be high (can't know if some 4-digit number is a year, or hour-minute pair without knowing that it's immediately followed by AM/PM, etc).

I'm now thinking that the right thing to do for the flexible parser is to parse to some intermediate representation of the format, then a second "constraint" pass to relieve any tensions (e.g. two possible days - choose the more likely one), then your intermediate representation can be applied to any string (not just the one you used to create it). That would help in the overwhelmingly common case of "I need to parse a bunch of dates in the same format", and it creates two very flexible seam points at which users can make decisions about what gets parsed.

In any case, that doesn't mean I want bare functions sitting around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, that doesn't mean I want bare functions sitting around.

OK, I evidently mis-interpreted your previous comment. Never mind.

Copy link
Member

@mariocj89 mariocj89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, just got through a third of the review. Will try to get it finished tonight :S


__all__ = ['parse', 'parser', 'parserinfo',
'isoparse', 'Isoparser',
'InvalidDatetimeError', 'InvalidDateError', 'InvalidTimeError']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to bring this to this namespace as well, don't you? It is neither a module nor a local variable.

I think doing this will break from dateutil.parser import *

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add some tests to verify current behaviour on the import * and general imports before moving it :X

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll add that and add some tests.

A single character that separates date and time portions
:param default_year:
The default year to be used as the basis for parsing the uncommon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to document the default is the current year

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I put the default year documentation in isoparse, but I should also have it here.

The default year to be used as the basis for parsing the uncommon
no-year date formats.
"""
if (len(sep.encode('utf-8')) != 1 or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need to do encode at all.

You can just check the length, and if you are worried on unicode characters you are already using ord that will give you the unicode point.

>>> ord(u'语')
35821

Raising it as in python2 you will be calling encode in bytestrings (should not cause any issue unless using unicode characters, but quite weird)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call. I added the ord check after the encode portion and didn't realize it obviated the need to encode.

@jbrockmendel
Copy link
Contributor

I'm a bit uncomfortable with "2014-224" format. That seems like an easy way to get false-positives. Do you have a mechanism in mind for letting users opt-in/opt-out of accepting these less-common formats?

@pganssle
Copy link
Member Author

pganssle commented Nov 6, 2017

@jbrockmendel Yes. Technically this code already has one - if you parse with common=True, you won't hit the YYYY-?Www-?D?, --MM-DD or YYYY-DDD branches. I'm thinking that "common" and "uncommon" are not descriptive or fine-grained enough, though, so I was thinking of introducing an isoparserinfo class containing feature flags, with some pre-configured constants for the equivalent of common and uncommon.

I think the final interface will be partially driven by how easy it is to make these feature flags performant in both Python and Cython (as I intend for isoparser to be the first module with an API-compatible C, Cython and/or Rust backend).

@jbrockmendel
Copy link
Contributor

That seems like a reasonable approach.

(as I intend for isoparser to be the first module with an API-compatible C, Cython and/or Rust backend).

I think this may also bear on my feeling of rushedness mentioned over in the pandas thread. This represents a pretty big shift in terms of build/distribution.

That said, the rust bit in particular would be very neat.

@pganssle
Copy link
Member Author

pganssle commented Nov 6, 2017

@jbrockmendel To be very clear, I am not planning on releasing a compiled backend as part of the 2.7.0 release, but I have been using this as a test module for doing that. I've implemented parts of the algorithm in Rust and in Cython for some basic profiling and I'm using those results to try to guide what the public part of the API should look like in Python.

I foresee a compiled backend release happening in a few stages, either I'll start with a separate cdateutil package or some sort of build / install options where the compiled backend is entirely optional. The end goal in the near term would be to have one pure python release and then pre-build binaries for a number of architectures, that way anyone on an unsupported architecture could always fall back to the pure python version.

With regards to the use of Rust, I haven't got the Rust part binding to Python quite yet, but I'm not optimistic about its performance, at least partially because I don't see an obvious way to have Rust call the C datetime constructor (there's one version that takes PyObjects, and then another version that takes a bunch of ints - the second version is much faster than the first), and even without that it's not considerably faster than my lightly-optimized Cython implementation.

Rust also has a few other strikes against it. First, it will be harder to accept PRs that would have an effect in the Rust code, since contributors will likely find it harder to work with than Cython or C. Second, there's not nearly the same support for Rust<->Python bindings that there is for C<->Python bindings. Third (and this may be similar to the second reason), building this thing cross-platform is going to be hard enough, adding in a dependency on rust/cargo might make things even more complicated.

@pganssle
Copy link
Member Author

pganssle commented Nov 7, 2017

From profiling, it seems like one of the biggest slowdowns in parsing time zone aware zones is constructing tzoffset. I think making that a singleton will improve that markedly for all real world scenarios.

Edit: Singleton constructor in #504

@pganssle pganssle force-pushed the isoparser branch 2 times, most recently from ccd00f5 to ab6d81f Compare November 10, 2017 23:16
@pganssle
Copy link
Member Author

pganssle commented Nov 10, 2017

For the last bullet point, I think I'm going to drop all class methods and go with all instance methods, then add aliases for parse_isodate, parse_isotime and parse_isotzstr in the main namespace, to start with.

Unfortunately, there's still one big work item, which is that the isoparse_common and isoparse_uncommon distinction is weird and unpleasant to me, so I need to refactor that with feature flags.

That said, this is becoming a pretty long-lived feature branch. I'm thinking maybe in this branch we'll drop the common/uncommon distinction entirely (except as an implementation detail) so that we can go ahead and merge a minimal-features version of this, then add a separate branch for the feature flag refactoring.

@jbrockmendel @mariocj89 Thoughts?

@jbrockmendel
Copy link
Contributor

I said several days ago I would review this promptly and then didn't. Double-plus will do so today.

Copy link
Member

@mariocj89 mariocj89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, additionally it seems you only hit 97.00% coverage on the new isoparse file, might be good checking what lines were missed (link in codecov is broken)

len_str = len(dt_str)
components = [1, 1, 1]

pos = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this line or change pos=4 to pos +=4

###
# Invalid ISO strings
@pytest.mark.parametrize('isostr,exception', [
('201', ValueError), # ISO string too short
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it fails nicely but would be good to get a test case for dates as:
YYYY-MM-D as 2016-12-1

For future development

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think YYYY-MM-D is a supported format, is it?

@pganssle pganssle merged commit d207009 into dateutil:master Dec 6, 2017
@pganssle pganssle mentioned this pull request Dec 6, 2017
@jbrockmendel
Copy link
Contributor

Little bit late, but LGTM. I'd like to implement robust profiling before going too much further down this road (i.e. into cython-land).

The numpy/pandas iso8601-esque parsers catch "today" and "now". Are those part of the spec?

@pganssle
Copy link
Member Author

pganssle commented Dec 9, 2017

The numpy/pandas iso8601-esque parsers catch "today" and "now". Are those part of the spec?

Most decidedly not. The spec is about very precise specifications of dates and times. now and today are among the least precise ways to serialize a date and time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants