Implement dedicated isoparser #489

pganssle · 2017-11-01T18:09:18Z

Here's a first pass at a dedicated iso parser per #424.

This involves a reorganization of the code for parser into its own little submodule, to avoid the explosion of parser.py. I think there will likely be some problems for downstream users who are using private interfaces, but I'm not too worried about that.

I've got a rough sketch for a Cython version of this that is ~20x faster, but it seems like even this pure-python version beats a datetime.strptime call:

In [1]: from dateutil.parser import isoparse
   ...: from datetime import datetime
   ...: 
   ...: dt = '2014-02-28T14:25:03.446990'
   ...: 

In [2]: %timeit isoparse(dt)
10000 loops, best of 3: 20.7 µs per loop

In [3]: %timeit datetime.strptime(dt, '%Y-%m-%dT%H:%M:%S.%f')
The slowest run took 142.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 34 µs per loop

I may have gone a bit overboard with the test parametrization. More tests need to be added as well - I'm not, for example, testing any failure conditions.

Remaining items to do:

Explicitly convert to ASCII bytestring (there is no unicode in valid ISO-8601, and this assumption will make the Cython implementation faster)
Add tests for the parse_isodate and parse_isotime functions.
Decide what to do about the mix of classmethods and methods.

Deferred:

Clean up documentation so that the primary interface is more explicitly isoparse.

Ping @jbrockmendel @wagner-certat @mariocj89

jbrockmendel · 2017-11-01T18:42:57Z

I'll go over this sometime in the next couple days.

If we go down this path (which I'm generally positive on), we should consider adapting the numpy/pandas implementation.

jbrockmendel · 2017-11-01T18:43:50Z

.travis.yml


 install:
  - pip install six
  - ./ci_tools/retry.sh python updatezinfo.py

 script:
-  - coverage run --omit=setup.py,dateutil/test/* setup.py test
+  - coverage run --omit=setup.py,dateutil/test/* -m pytest


Let's separate this out, as it has caused heisenbugs in the past.

Separate it out how?

Also, if by "heisenbugs" you mean that issue with your other PR, I think that may have been to do with the yield method of creating tests.

Separate it out how?

I'm suggesting the move to pytest should be done in an independent PR (I think you may have already updated this below)

may have been to do with the yield method of creating tests

Good to know, thanks.

OK, yeah, maybe I'll cherry-pick the pytest commit and rebase this on master after we do that.

If you're doing cherry-picking for small fairly-immediate changes, making parser a directory is worthwhile. It'll make for much smaller diffs in other PRs.

Hm.. Probably a good idea, but I think I'll just wrap up the feature as is and do one big rebase at the end, since the "move the parser to a folder" wasn't its own commit (stupidly).

pganssle · 2017-11-01T19:07:11Z

If we go down this path (which I'm generally positive on), we should consider adapting the numpy/pandas implementation.

I don't think this would be particularly useful. It's hard to directly compare what they are doing with what we are doing, but I don't think it would be terribly difficult to get something as fast as they have if we drop down to C, and they don't actually fully support ISO strings. Consider the following edge cases:

import numpy as np

iso_strings = ['2014-W03', '2014-W03-3',   # ISO week
               '2014-224',                 # Ordinal day
               '2014-01-01T00-12:15',      # Timezone aware
               '20140301T24:00'            # Midnight as 24:00
               ]

for iso_string in iso_strings:
    try:
        np.datetime64(iso_string)
        print('Sucess: {}'.format(iso_string))
    except ValueError:
        pass

The only one that parses correctly is the timezone aware one, and that's apparently deprecated:

/home/c0zj/anaconda3/bin/ipython:11: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
Sucess: 2014-01-01T00-12:15

I don't think it should be terribly difficult to eventually add a C implementation here that's quite fast, but even the pure python version I've got so far (with no profiling to look for bottlenecks, mind you), is already faster than strftime, which is saying something.

pganssle · 2017-11-01T21:25:55Z

The items that will be left for a separate PR:

Add hypothesis tests here (should be easy to test the hypothesis that dates, times and datetimes should be round-trippable with .isoformat())
Add a fast C version for isoparse
Add support for offsets and periods.

jbrockmendel · 2017-11-02T00:00:24Z

I don't think this would be particularly useful. It's hard to directly compare what they are doing with what we are doing, but I don't think it would be terribly difficult to get something as fast as they have if we drop down to C, and they don't actually fully support ISO strings

Fair enough. I mention it because I've been spending a lot of time recently working on pandas._libs.tslibs so am pretty familiar with it. If we implement something strictly better so that we can take that off their plate, that would be a big win. Also note that pandas altered the numpy version to support tz-aware returns.

jbrockmendel · 2017-11-02T00:01:56Z

Add support for offsets and periods.

Can you clarify what you mean by offsets and periods? Again my train of thought goes to pd.offsets and pd.Period.

jbrockmendel · 2017-11-02T00:09:49Z

dateutil/parser/isoparser.py

+__all__ = ["isoparse", "Isoparser"]
+
+
+class Isoparser(object):


Is this something you expect users to subclass regularly?

This is kinda one of the open questions, honestly. I've been a bit haphazard about the design of this class because I keep vacillating between "users will create an instance" and "users will create a subclass", which is why half the methods are classmethod and half are instance methods.

I think I'm partially being biased by the parser, where my hope was that downstream users would subclass parser to allow customization of behavior, but even in the parser, I'm thinking that might not be the right design for this particular sort of task.

After giving it a bit of thought, I believe that since ISO-8601 has a very clear set of options it supports, it might be best if it we refactored this to be more like parser/parserinfo, where isoparserinfo would have some limited feature flags like ordinal_dates and weeks (so that there's not this artificial division between "common" and "uncommon"). The main downside to adding a ton of feature flags may be speed, but we'll benchmark that when we come to it.

That said, in terms of subclasses, I think that when #449 is done, I'll probably want an internal subclass that returns a PartialDatetime instead of datetime (that way the default_year logic will be unnecessary).

[...] but even in the parser, I'm thinking that might not be the right design for this particular sort of task.

Yah, I've gone along with the make-everything-a-method approach since there isn't a ton of downside, but I think YAGNI probably applies unless we have good reason to believe users actually are subclassing.

(This may be a discussion that belongs elsewhere, don't want to derail this PR's discussion)

Especially for e.g. parser._build_tzinfo, I have a hard time imagining the use case where it is easier to subclass than it is to pass a callable for tzinfos. So I might start pushing to be more selective about does-this-really-need-to-be-a-method.

Well, no one should be subclassing parser._build_tzinfo anyway, since it's a private method. I don't see that there's any downside to making these things methods, and it namespaces the methods properly, at the very least.

The reason that my vision for the future of the parser was inheriting from parser was that part of the point of breaking things up into small pieces is that it means there's a lot of seams in the code where behavior can be overridden to handle weird non-standard formats. We can discuss this in another location, but the only reason I've soured on that a bit is that I think a lot of the information about parser state is inherently non-local, so the coupling between all these tiny functions is going to necessarily be high (can't know if some 4-digit number is a year, or hour-minute pair without knowing that it's immediately followed by AM/PM, etc).

I'm now thinking that the right thing to do for the flexible parser is to parse to some intermediate representation of the format, then a second "constraint" pass to relieve any tensions (e.g. two possible days - choose the more likely one), then your intermediate representation can be applied to any string (not just the one you used to create it). That would help in the overwhelmingly common case of "I need to parse a bunch of dates in the same format", and it creates two very flexible seam points at which users can make decisions about what gets parsed.

In any case, that doesn't mean I want bare functions sitting around.

In any case, that doesn't mean I want bare functions sitting around.

OK, I evidently mis-interpreted your previous comment. Never mind.

mariocj89

Some small comments, just got through a third of the review. Will try to get it finished tonight :S

mariocj89 · 2017-11-02T09:18:40Z

dateutil/parser/__init__.py

+
+__all__ = ['parse', 'parser', 'parserinfo',
+           'isoparse', 'Isoparser',
+           'InvalidDatetimeError', 'InvalidDateError', 'InvalidTimeError']


You need to bring this to this namespace as well, don't you? It is neither a module nor a local variable.

I think doing this will break from dateutil.parser import *

You might want to add some tests to verify current behaviour on the import * and general imports before moving it :X

Good catch! I'll add that and add some tests.

mariocj89 · 2017-11-02T09:22:07Z

dateutil/parser/isoparser.py

+            A single character that separates date and time portions
+
+        :param default_year:
+            The default year to be used as the basis for parsing the uncommon


You might want to document the default is the current year

Good point. I put the default year documentation in isoparse, but I should also have it here.

mariocj89 · 2017-11-02T09:25:02Z

dateutil/parser/isoparser.py

+            The default year to be used as the basis for parsing the uncommon
+            no-year date formats.
+        """
+        if (len(sep.encode('utf-8')) != 1 or


You might not need to do encode at all.

You can just check the length, and if you are worried on unicode characters you are already using ord that will give you the unicode point.

>>> ord(u'语') 35821

Raising it as in python2 you will be calling encode in bytestrings (should not cause any issue unless using unicode characters, but quite weird)

Ah, good call. I added the ord check after the encode portion and didn't realize it obviated the need to encode.

jbrockmendel · 2017-11-06T17:15:20Z

I'm a bit uncomfortable with "2014-224" format. That seems like an easy way to get false-positives. Do you have a mechanism in mind for letting users opt-in/opt-out of accepting these less-common formats?

pganssle · 2017-11-06T17:22:26Z

@jbrockmendel Yes. Technically this code already has one - if you parse with common=True, you won't hit the YYYY-?Www-?D?, --MM-DD or YYYY-DDD branches. I'm thinking that "common" and "uncommon" are not descriptive or fine-grained enough, though, so I was thinking of introducing an isoparserinfo class containing feature flags, with some pre-configured constants for the equivalent of common and uncommon.

I think the final interface will be partially driven by how easy it is to make these feature flags performant in both Python and Cython (as I intend for isoparser to be the first module with an API-compatible C, Cython and/or Rust backend).

jbrockmendel · 2017-11-06T23:13:06Z

That seems like a reasonable approach.

(as I intend for isoparser to be the first module with an API-compatible C, Cython and/or Rust backend).

I think this may also bear on my feeling of rushedness mentioned over in the pandas thread. This represents a pretty big shift in terms of build/distribution.

That said, the rust bit in particular would be very neat.

pganssle · 2017-11-06T23:48:59Z

@jbrockmendel To be very clear, I am not planning on releasing a compiled backend as part of the 2.7.0 release, but I have been using this as a test module for doing that. I've implemented parts of the algorithm in Rust and in Cython for some basic profiling and I'm using those results to try to guide what the public part of the API should look like in Python.

I foresee a compiled backend release happening in a few stages, either I'll start with a separate cdateutil package or some sort of build / install options where the compiled backend is entirely optional. The end goal in the near term would be to have one pure python release and then pre-build binaries for a number of architectures, that way anyone on an unsupported architecture could always fall back to the pure python version.

With regards to the use of Rust, I haven't got the Rust part binding to Python quite yet, but I'm not optimistic about its performance, at least partially because I don't see an obvious way to have Rust call the C datetime constructor (there's one version that takes PyObjects, and then another version that takes a bunch of ints - the second version is much faster than the first), and even without that it's not considerably faster than my lightly-optimized Cython implementation.

Rust also has a few other strikes against it. First, it will be harder to accept PRs that would have an effect in the Rust code, since contributors will likely find it harder to work with than Cython or C. Second, there's not nearly the same support for Rust<->Python bindings that there is for C<->Python bindings. Third (and this may be similar to the second reason), building this thing cross-platform is going to be hard enough, adding in a dependency on rust/cargo might make things even more complicated.

pganssle · 2017-11-07T16:38:50Z

From profiling, it seems like one of the biggest slowdowns in parsing time zone aware zones is constructing tzoffset. I think making that a singleton will improve that markedly for all real world scenarios.

Edit: Singleton constructor in #504

pganssle · 2017-11-10T23:26:27Z

For the last bullet point, I think I'm going to drop all class methods and go with all instance methods, then add aliases for parse_isodate, parse_isotime and parse_isotzstr in the main namespace, to start with.

Unfortunately, there's still one big work item, which is that the isoparse_common and isoparse_uncommon distinction is weird and unpleasant to me, so I need to refactor that with feature flags.

That said, this is becoming a pretty long-lived feature branch. I'm thinking maybe in this branch we'll drop the common/uncommon distinction entirely (except as an implementation detail) so that we can go ahead and merge a minimal-features version of this, then add a separate branch for the feature flag refactoring.

@jbrockmendel @mariocj89 Thoughts?

jbrockmendel · 2017-11-11T17:23:13Z

I said several days ago I would review this promptly and then didn't. Double-plus will do so today.

mariocj89

LGTM, additionally it seems you only hit 97.00% coverage on the new isoparse file, might be good checking what lines were missed (link in codecov is broken)

mariocj89 · 2017-12-06T14:41:48Z

dateutil/parser/isoparser.py

+        len_str = len(dt_str)
+        components = [1, 1, 1]
+
+        pos = 0


you can remove this line or change pos=4 to pos +=4

mariocj89 · 2017-12-06T15:03:48Z

dateutil/test/test_isoparser.py

+###
+# Invalid ISO strings
+@pytest.mark.parametrize('isostr,exception', [
+    ('201', ValueError),                        # ISO string too short


Seems it fails nicely but would be good to get a test case for dates as:
YYYY-MM-D as 2016-12-1

For future development

I don't think YYYY-MM-D is a supported format, is it?

jbrockmendel · 2017-12-08T20:01:08Z

Little bit late, but LGTM. I'd like to implement robust profiling before going too much further down this road (i.e. into cython-land).

The numpy/pandas iso8601-esque parsers catch "today" and "now". Are those part of the spec?

pganssle · 2017-12-09T00:07:41Z

The numpy/pandas iso8601-esque parsers catch "today" and "now". Are those part of the spec?

Most decidedly not. The spec is about very precise specifications of dates and times. now and today are among the least precise ways to serialize a date and time.

pganssle added enhancement parser labels Nov 1, 2017

pganssle added this to the 2.7.0 milestone Nov 1, 2017

pganssle changed the title ~~Implement dedicated isoparser~~ [WIP] Implement dedicated isoparser Nov 1, 2017

pganssle force-pushed the isoparser branch 4 times, most recently from df12c51 to 48c8ecf Compare November 1, 2017 18:30

jbrockmendel reviewed Nov 1, 2017

View reviewed changes

pganssle force-pushed the isoparser branch from 536fb67 to d5ac311 Compare November 1, 2017 21:13

jbrockmendel reviewed Nov 2, 2017

View reviewed changes

mariocj89 reviewed Nov 2, 2017

View reviewed changes

pganssle mentioned this pull request Nov 6, 2017

Forthcoming dateutil code reorganization pandas-dev/pandas#18141

Closed

pganssle force-pushed the isoparser branch from d5ac311 to ef28395 Compare November 6, 2017 16:06

pganssle mentioned this pull request Nov 7, 2017

Request the parser to stick to ISO 8601 #503

Closed

pganssle force-pushed the isoparser branch from ef28395 to 37bd358 Compare November 7, 2017 15:38

pganssle force-pushed the isoparser branch 2 times, most recently from ccd00f5 to ab6d81f Compare November 10, 2017 23:16

pganssle added 19 commits December 6, 2017 12:43

Make time zone string parser more strict

561d31c

Add test for Isoparser constructor failures

6ab81f3

Add test for too-short ISO string

c8964ad

Make restrictions on sep stricter

396b898

Add decorator to coerce to ASCII

a06ab96

Refactor a public interface out of _parse_tzstr

b8a1a83

Put @_takes_ascii decorator after @classmethod

e49b89c

Add test for zero_as_utc behavior

0f14ddd

Fix byte slicing behavior for 2 and 3

93cdfaf

Ensure parse_isodate consumes entire string

3441094

Add tests for parse_isodate

284c1a8

Add exception on too-short ISO times

52e7294

Add tests for parse_isotime

5d63292

Drop pre-1900 date test on Python 2

211217b

Drop all classmethods in favor of instance methods

d65738c

Rename Isoparser to isoparser

afd05a4

Drop common_only option for initial API

5bf007f

Drop support for no-year ISO datetimes

fcc6160

Add xfail test for sep as bytestring

73657fa

pganssle force-pushed the isoparser branch from dca30c2 to 73657fa Compare December 6, 2017 12:47

mariocj89 reviewed Dec 6, 2017

View reviewed changes

pganssle added 3 commits December 6, 2017 15:33

Isoparser style tweaks

a3bd128

Improve test coverage

fd2d201

Fix unreachable code and loose validation

f8f802b

pganssle merged commit d207009 into dateutil:master Dec 6, 2017

pganssle mentioned this pull request Dec 6, 2017

Add isoparse() function. #424

Closed

pganssle mentioned this pull request Mar 11, 2018

Release 2.7.0 #627

Closed

ffe4 mentioned this pull request Aug 27, 2020

Incorrect value parsed as ISO date #1074

Open

Implement dedicated isoparser #489

Implement dedicated isoparser #489

Conversation

pganssle commented Nov 1, 2017 • edited Loading

jbrockmendel commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pganssle Nov 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pganssle Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

pganssle commented Nov 1, 2017

pganssle commented Nov 1, 2017

jbrockmendel commented Nov 2, 2017

jbrockmendel commented Nov 2, 2017

Choose a reason for hiding this comment

pganssle Nov 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariocj89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 6, 2017

pganssle commented Nov 6, 2017

jbrockmendel commented Nov 6, 2017

pganssle commented Nov 6, 2017

pganssle commented Nov 7, 2017 • edited Loading

pganssle commented Nov 10, 2017 • edited Loading

jbrockmendel commented Nov 11, 2017

mariocj89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 8, 2017

pganssle commented Dec 9, 2017

pganssle commented Nov 1, 2017 •

edited

Loading

pganssle Nov 1, 2017 •

edited

Loading

pganssle Nov 6, 2017 •

edited

Loading

pganssle Nov 2, 2017 •

edited

Loading

pganssle commented Nov 7, 2017 •

edited

Loading

pganssle commented Nov 10, 2017 •

edited

Loading