ENH: Infer best datetime format from a sample #52626

LeoGrin · 2023-04-12T13:52:46Z

closes ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.1.0.rst file if fixing a bug or adding a new feature.

Summary

pd.to_datetime now tries to infer the datetime format of each string by considering
a random sample (instead of the first non-null sample),
and tries to find the format which work for most strings. If several
formats work as well, the one which matches the dayfirst parameter is returned. If
format="mixed", pandas does the same thing, then tries the second best format on the
strings which failed to parse with the first best format, and so on (instead of parsing each row
independently) (#52508).

Previous behavior:

    In [1]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"])
    Out[1]:
    ValueError: time data "30-01-2012" doesn't match format "%m-%d-%Y", at position 2. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
    In [2]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], errors="coerce")
    Out[2]:
    DatetimeIndex(['2012-01-02', '2012-01-03', 'NaT'], dtype='datetime64[ns]', freq=None)
    In [3]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], format="mixed")
    Out[3]:
    DatetimeIndex(['2012-01-02', '2012-01-03', '2012-01-30'], dtype='datetime64[ns]', freq=None)

New behavior:

    In [1]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"])
    Out[1]:
    UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False was specified.
    Pass `dayfirst=True` or specify a format to silence this warning.
    DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]',
    freq=None)
    In [2]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], errors="coerce")
    Out[2]:
    UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
    DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]', freq=None)
    In [3]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], format="mixed")
    Out[3]:
    DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]', freq=None)

Design questions:

Do we bypass the dayfirst parameter too often? Right now, we always prefer the format which matches the most strings, and only consider dayfirst if several formats match as many strings (and raise a warning if we contradict dayfirst). For error="raise" it makes sense (we only select a format if it matches all non-null values), but I'm wondering if it's a problem for errors="coerce" or errors="ignore": in this case, if dayfirst = True but %m-%d%-%Y matches more values than %d-%m%-%Y, we select %m-%d%-%Y (and raise a warning). I think it's okay but it may be somewhat confusing for users. Some ideas otherwise:
- Have dayfirst=None (instead of False) as the default in pd.to_datetime, and respect dayfirst if it's provided.
- Have some threshold of match percentage to be able to overwrite dayfirst: for instance if dayfirst = True, %m-%d%-%Y needs to match >10% more values than %d-%m%-%Y to be chosen.

MarcoGorelli

thanks for working on this

couple of initial comments:

this looks very complicated, is a simpler solution possible?
this should be deterministic, and shouldn't depend on the result of np.random - can we use, say, the first 10 non-null elements? Or 10 equally spaced elements?

LeoGrin · 2023-04-12T15:01:41Z

Thanks for the quick reply!

this looks very complicated, is a simpler solution possible?

Probably, I'll try to make it simpler.

this should be deterministic, and shouldn't depend on the result of np.random - can we use, say, the first 10 non-null elements? Or 10 equally spaced elements?

Will fix

…time_format_inference_test

…eoGrin/pandas into datetime_format_inference_test

LeoGrin · 2023-04-24T13:08:41Z

@MarcoGorelli this is ready for review (don't know if I should tag or just wait)
The failing build seems to be related to #52853, not related to the PR.

MarcoGorelli

thanks for updating!

this still look quite complicated - this is going to have be maintained long-term by multiple people and so I'm a bit hesitant to add a lot of code. Is there a simpler solution which would still improve datetime format inference?

MarcoGorelli · 2023-04-24T13:25:48Z

doc/source/whatsnew/v0.19.0.rst

 .. ipython:: python
+   :okwarning:

   pd.to_datetime([1, "foo"], errors="coerce")


why is this needed?

Now the warning UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please... is raised when any value is a string (if no format was found). Before, it was raised when the first non-null value was a string, so wouldn't be raised in this example, but would be raised on pd.to_datetime(["foo", 1], errors="coerce") for instance.

LeoGrin · 2023-04-24T14:36:17Z

Thanks for the reply!

this still look quite complicated - this is going to have be maintained long-term by multiple people and so I'm a bit hesitant to add a lot of code. Is there a simpler solution which would still improve datetime format inference?

I understand the concern, but I'm not sure how to make it simpler. One thing I can easily remove, and is perhaps a bit complicated, is the _iterative_conversion part used when format="mixed": maybe for now we could directly fall back to dateutil when format="mixed" is provided? This wouldn't fix the issue where pandas create two different date formats when one would work (e.g pd.to_datetime(["01-02-2012", "13-01-2012"], format="mixed") --> DatetimeIndex(['2012-01-02', '2012-01-13'], dtype='datetime64[ns]', freq=None)) but would make the code simpler.

MarcoGorelli · 2023-05-02T13:19:42Z

maybe for now we could directly fall back to dateutil when format="mixed" is provided?

yeah format='mixed' means that no format inference should take place

github-actions · 2023-06-02T00:05:26Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

MarcoGorelli · 2023-06-02T12:24:33Z

Thanks for your PR

I'm afraid this introduces too much complexity, sorry, it's going to be too hard to maintain, there's already very few people comfortable with this part of the codebase

If you can find a simpler solution which improves format inference without greatly increasing the maintenance burden, then we can take. For example, just trying 5 elements and taking a majority vote would probably be an improvement and still be quite simple

LeoGrin added 6 commits April 12, 2023 14:02

All tests pass

fd7a534

Update changelog

93f9c7a

Add missing type hints

c37d40f

Cleaning

cbb5e0d

Typo

81664a2

comment change

ef33ba0

LeoGrin marked this pull request as draft April 12, 2023 14:20

MarcoGorelli requested changes Apr 12, 2023

View reviewed changes

LeoGrin added 2 commits April 13, 2023 13:07

simplification

e1652f1

remove randomness

6b371ca

LeoGrin changed the title ~~ENH: Infer best datetime format from a random sample~~ ENH: Infer best datetime format from a sample Apr 13, 2023

LeoGrin and others added 15 commits April 14, 2023 10:08

fix parser tests

705d1b4

Merge branch 'main' into datetime_format_inference_test

27e39f3

simplify getting evenly spaced non null

0bae15d

update io readme

de7331f

revert changed tests

9136b4f

fix type hints

9f966d5

Merge branch 'main' of https://github.com/pandas-dev/pandas into date…

be9e27a

…time_format_inference_test

Merge branch 'datetime_format_inference_test' of https://github.com/L…

e5e3cb3

…eoGrin/pandas into datetime_format_inference_test

fix type hints for np.unique

7ca7244

remove prints

4b81192

fix doc

001a270

fix example with febuary 30th

fe99f83

Merge branch 'main' into datetime_format_inference_test

1d5b6d1

fix doc

8de90e4

Merge branch 'datetime_format_inference_test' of https://github.com/L…

0b5ec7d

…eoGrin/pandas into datetime_format_inference_test

LeoGrin marked this pull request as ready for review April 14, 2023 20:17

LeoGrin added 2 commits April 14, 2023 22:17

Merge branch 'main' into datetime_format_inference_test

544aade

Merge branch 'main' into datetime_format_inference_test

a236ba9

LeoGrin and others added 20 commits April 23, 2023 22:24

Merge branch 'main' into datetime_format_inference_test

47fe413

All tests pass

281d45b

Update changelog

51d9d98

Add missing type hints

1d7df6e

Cleaning

e6cf3ad

Typo

6998bf8

comment change

f98ea1f

simplification

86aa61c

remove randomness

8c6401b

fix parser tests

28cf679

simplify getting evenly spaced non null

a22114c

update io readme

75bb8f6

revert changed tests

6f155b5

fix type hints

2b2648e

fix type hints for np.unique

3f02e0a

remove prints

feaa7a3

fix doc

60148b1

fix example with febuary 30th

6622eba

fix doc

23b28b9

Merge branch 'main' into datetime_format_inference_test

5cbfb2f

MarcoGorelli requested changes Apr 24, 2023

View reviewed changes

LeoGrin added 2 commits April 24, 2023 16:24

check if any str at the beginning of _guess_datetime_format_for_array

5422bfa

check if any str at the beginning of _guess_datetime_format_for_array

3aa3cde

github-actions bot added the Stale label Jun 2, 2023

MarcoGorelli closed this Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Infer best datetime format from a sample #52626

ENH: Infer best datetime format from a sample #52626

LeoGrin commented Apr 12, 2023 •

edited

MarcoGorelli left a comment

LeoGrin commented Apr 12, 2023 •

edited

LeoGrin commented Apr 24, 2023

MarcoGorelli left a comment

MarcoGorelli Apr 24, 2023

LeoGrin Apr 24, 2023

LeoGrin commented Apr 24, 2023 •

edited

MarcoGorelli commented May 2, 2023

github-actions bot commented Jun 2, 2023

MarcoGorelli commented Jun 2, 2023

ENH: Infer best datetime format from a sample #52626

ENH: Infer best datetime format from a sample #52626

Conversation

LeoGrin commented Apr 12, 2023 • edited

Summary

Design questions:

MarcoGorelli left a comment

Choose a reason for hiding this comment

LeoGrin commented Apr 12, 2023 • edited

LeoGrin commented Apr 24, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Apr 24, 2023

Choose a reason for hiding this comment

LeoGrin Apr 24, 2023

Choose a reason for hiding this comment

LeoGrin commented Apr 24, 2023 • edited

MarcoGorelli commented May 2, 2023

github-actions bot commented Jun 2, 2023

MarcoGorelli commented Jun 2, 2023

LeoGrin commented Apr 12, 2023 •

edited

LeoGrin commented Apr 12, 2023 •

edited

LeoGrin commented Apr 24, 2023 •

edited