Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Rely on C-level str conversions in loadtxt for up to 2x speedup #19687

Closed
wants to merge 4 commits into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Aug 17, 2021

This PR builds on top of #19618 (to avoid a tricky rebase) and #19042 (which this PR would have fixed more or less naturally anyways, so I may as well credit @DFEvans for the test he wrote in that PR). (#19618 has now been merged, so this is now ready for review.)

The general idea is as follows:

  • First, we treat each row of loadtxt's input as a single item of a structured dtype with no nested structured dtypes, with as many fields as needed. If loadtxt was given a scalar dtype, then the structured dtype is constructed by creating as many fields (each with that scalar dtype) as there are columns; if a structured dtype was requested, then first flatten the dtype (as explained in genfromtxt fails when a non-contiguous dtype is requested #19623, but correctly taking offsets into account) and repeat the fields as needed. (Note that this also fixes a previous bug, whereby loading e.g. "0 1 2 3 4\n5 6 7 8 9" with dtype=[("a", int), ("b", int)] would return [[(0, 1), (2, 3)], [(5, 6), (7, 8)]] and silently drop the last column -- the old behavior seems clearly buggy.)
    Once the whole array is read, we then .view() back to the actually requested dtype. This implies an extraneous copy if the requested dtype has .hasobject = True (which would be fixed by ENH: Make it possible to call .view on object arrays #8514); I believe that that case is rare enough to be ignored for now (and a fix is possible anyways).
    In itself, this is much faster (~30%) for loading actual structured dtypes (by skipping the recursive packer), somewhat faster (~5-10%) for large loads (>10_000 rows, perhaps because shape inference of the final array is faster?), and much slower (nearly 2x) for very small loads (10 rows) or for reads using dtype=object; however, the main point is to allow the next points.

  • Then, we take advantage of the possibility of assigning a tuple of strs to a structured dtype with e.g. float fields, and have the strs be implicitly converted to floats by numpy at the C-level. (A Python-level fallback is kept to support e.g. hex floats.) Together with the previous commit, this provides a massive speedup (~2x on the loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is beneficial with as little as 100 rows. Very small reads (10 rows) are still slower (nearly 2x for object), as well as reads using object dtypes (still due to the extra copy), but the tradeoff seems, again, worthwhile.

  • Finally, using structured dtypes provides a small extra advantage, in that they implicitly check the number of fields in the input, and thus allow skipping the len(words) == ncols check; even that is a ~5% speedup for the largest loads (100_000 rows) of numeric scalar types.

Overall, the benchmarks (compared to #19618) shown below indicate a >2x speedup for large reads of simple numeric types, and a slowdown of very small reads (~1.5x for 10 rows) or reads of object arrays (~10% for large reads, ~2x for small ones). But small reads are fast anyways and reading into object arrays is, again, likely rare.

       before           after         ratio
     [45f9118f]       [4b7b0f46]
     <loadtxtusecols>       <_wip/loadtxtflatdtype>
+      26.8±0.2μs       51.8±0.2μs     1.93  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10)
+      29.7±0.2μs       50.6±0.3μs     1.71  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10)
+      29.2±0.2μs       46.2±0.2μs     1.58  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
+      30.3±0.2μs       47.3±0.3μs     1.56  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
+      29.3±0.2μs       45.6±0.2μs     1.56  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10)
+      29.2±0.1μs       45.5±0.1μs     1.56  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
+      30.2±0.2μs       44.6±0.4μs     1.47  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
+      32.2±0.6μs      44.7±0.09μs     1.39  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10)
+       134±0.3μs          169±1μs     1.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100)
+      63.9±0.6μs       77.4±0.3μs     1.21  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20)
+       159±0.6μs          177±1μs     1.12  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100)
+       134±0.5ms        149±0.4ms     1.11  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
+     12.6±0.04ms       13.5±0.2ms     1.07  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10000)
-         435±2μs          374±2μs     0.86  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
-         592±2μs        495±0.9μs     0.84  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-         591±3μs          493±3μs     0.84  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-     5.48±0.04ms      4.51±0.02ms     0.82  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)
-     1.06±0.01ms          870±6μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-     1.06±0.01ms          868±7μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-     10.4±0.04ms      8.38±0.09ms     0.80  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-     4.16±0.02ms      3.33±0.01ms     0.80  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
-      42.1±0.2ms       33.6±0.1ms     0.80  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
-      10.6±0.1ms      8.31±0.01ms     0.78  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-         168±1μs          130±2μs     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-       158±0.9μs          120±1μs     0.76  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100)
-       159±0.6μs          119±1μs     0.75  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-       160±0.7μs          119±1μs     0.74  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-         198±2ms          139±3ms     0.70  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-         220±3ms          153±3ms     0.70  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-       154±0.3ms        107±0.6ms     0.69  bench_io.LoadtxtCSVStructured.time_loadtxt_csv_struct_dtype
-         219±3ms          152±3ms     0.69  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-       168±0.6μs          114±1μs     0.68  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-     7.30±0.06ms      4.66±0.02ms     0.64  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3])
-     15.3±0.09ms       9.68±0.2ms     0.63  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-     8.83±0.05ms      5.58±0.03ms     0.63  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3, 5, 7])
-       161±0.6ms         94.9±2ms     0.59  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         193±5μs        113±0.8μs     0.59  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-      14.7±0.1ms      8.43±0.06ms     0.57  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10000)
-      14.6±0.1ms      8.36±0.02ms     0.57  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
-     14.6±0.07ms      8.27±0.06ms     0.57  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
-       155±0.8ms       82.8±0.3ms     0.53  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
-       156±0.9ms         82.9±1ms     0.53  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100000)
-       156±0.4ms       82.7±0.6ms     0.53  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
-     15.4±0.05ms      7.85±0.07ms     0.51  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-       161±0.5ms       79.2±0.5ms     0.49  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-      18.1±0.2ms      7.88±0.07ms     0.44  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-       188±0.5ms       78.4±0.9ms     0.42  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)

Still, if one includes the earlier speedups to loadtxt that I posted recently, even these cases are faster than previously (see below), so I'll apply the credit of these earlier PRs towards this one :)

       before           after         ratio
     [a1ee7968]       [4b7b0f46]
     <_pushme/loadtxtlencheck~9>       <_wip/loadtxtflatdtype>
-      55.2±0.4μs       50.6±0.4μs     0.92  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10)
-      51.6±0.1μs       45.1±0.2μs     0.87  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
-      53.3±0.6μs       46.5±0.4μs     0.87  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
-      52.7±0.3μs       45.6±0.3μs     0.87  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
-      53.5±0.2μs       45.9±0.5μs     0.86  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10)
-      57.5±0.3μs       47.5±0.4μs     0.83  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
-      93.4±0.6μs         77.0±1μs     0.82  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20)
-      54.9±0.2μs       44.8±0.2μs     0.82  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10)
-         686±2μs          379±2μs     0.55  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(200)
-         321±1μs          173±2μs     0.54  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100)
-         952±4μs          495±4μs     0.52  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-         347±2μs          180±2μs     0.52  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100)
-     6.51±0.01ms      3.33±0.01ms     0.51  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(2000)
-         960±5μs          491±3μs     0.51  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-      66.1±0.3ms       33.6±0.2ms     0.51  bench_io.LoadtxtCSVDateTime.time_loadtxt_csv_datetime(20000)
-     1.72±0.01ms          857±6μs     0.50  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-     9.23±0.04ms      4.58±0.04ms     0.50  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv(2)
-     1.74±0.01ms          858±9μs     0.49  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-     17.0±0.03ms      8.22±0.05ms     0.48  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-     17.0±0.09ms      8.23±0.07ms     0.48  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-       313±0.8ms          150±3ms     0.48  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 100000)
-      30.1±0.2ms       13.7±0.3ms     0.46  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('object', 10000)
-         330±1ms          150±3ms     0.46  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 100000)
-      32.7±0.3ms       14.4±0.2ms     0.44  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('str', 10000)
-         259±1ms        107±0.4ms     0.41  bench_io.LoadtxtCSVStructured.time_loadtxt_csv_struct_dtype
-         363±1ms        136±0.8ms     0.37  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-         399±2ms        148±0.2ms     0.37  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-         404±3ms        149±0.6ms     0.37  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-       335±0.5μs          119±2μs     0.36  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100)
-         334±3μs        118±0.6μs     0.35  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-         321±2μs          114±2μs     0.35  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-         336±3μs          118±1μs     0.35  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-         374±5μs        127±0.5μs     0.34  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-       352±0.5μs        113±0.9μs     0.32  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-     18.4±0.05ms      5.58±0.08ms     0.30  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3, 5, 7])
-     16.0±0.08ms      4.67±0.06ms     0.29  bench_io.LoadtxtUseColsCSV.time_loadtxt_usecols_csv([1, 3])
-      31.6±0.2ms       8.48±0.1ms     0.27  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10000)
-      31.2±0.2ms      8.33±0.09ms     0.27  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
-      31.4±0.2ms      8.29±0.09ms     0.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
-     35.8±0.04ms       9.36±0.2ms     0.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-      30.5±0.1ms       7.96±0.1ms     0.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-         325±1ms         83.3±2ms     0.26  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(100000)
-       363±0.9ms         92.8±1ms     0.26  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         311±2ms         79.3±2ms     0.25  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-         324±1ms       82.4±0.7ms     0.25  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
-       325±0.4ms       82.5±0.5ms     0.25  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
-      33.2±0.2ms      7.97±0.08ms     0.24  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-       343±0.7ms         78.8±2ms     0.23  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)

@anntzer anntzer force-pushed the loadtxtflatdtype branch 3 times, most recently from af33ee7 to c98e34d Compare August 17, 2021 20:46
@charris charris changed the title PERF: Rely on C-level str conversions in loadtxt, for an up to 2x speedup MAINT: Rely on C-level str conversions in loadtxt for an up to 2x speedup Aug 18, 2021
@charris charris changed the title MAINT: Rely on C-level str conversions in loadtxt for an up to 2x speedup MAINT: Rely on C-level str conversions in loadtxt for up to 2x speedup Aug 18, 2021
@anntzer anntzer changed the title MAINT: Rely on C-level str conversions in loadtxt for up to 2x speedup PERF: Rely on C-level str conversions in loadtxt for up to 2x speedup Aug 18, 2021
@anntzer anntzer force-pushed the loadtxtflatdtype branch 3 times, most recently from bc2e615 to df5ee9f Compare August 23, 2021 04:09
]
# These converters only ever get str (not bytes) as input.
_CONVERTER_DICT = {
np.bool_: int, # Implicitly converted to bool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Booleans are only allowed values of 0 or 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that we only need to cast the str to an int, and then we can let the dtype machinery do the int->np.bool_ cast (and np.bool_(42) == np.bool_(True)).

DFEvans and others added 4 commits August 26, 2021 16:20
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
This is much faster (~30%) for loading actual structured dtypes (by
skipping the recursive packer), somewhat faster (~5-10%) for large loads
(>10000 rows, perhaps because shape inference of the final array is
faster?), and much slower (nearly 2x) for very small loads (10 rows) or
for reads using `dtype=object` (due to the extraneous limitation on
object views, which could be fixed separately); however, the main point
is to allow further optimizations.
This patch takes advantage of the possibility of assigning a tuple of
*strs* to a structured dtype with e.g. float fields, and have the strs
be implicitly converted to floats by numpy at the C-level.  (A
Python-level fallback is kept to support e.g. hex floats.)  Together
with the previous commit, this provides a massive speedup (~2x on the
loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is
beneficial with as little as 100 rows.  Very small reads (10 rows) are
still slower (nearly 2x for object), as well as reads using object
dtypes (due to the extra copy), but the tradeoff seems worthwhile.
In the fast-path of loadtxt, the conversion to np.void implicitly checks
the number of fields.  Removing the explicit length check saves ~5% for
the largest loads (100_000 rows) of numeric scalar types.
@anntzer
Copy link
Contributor Author

anntzer commented Sep 6, 2021

Kindly bumping.

@anntzer
Copy link
Contributor Author

anntzer commented Sep 16, 2021

@charris Anything I can do to move this forward? Thanks! (Sorry, I'm picking on you as you already left a review comment :-))

@seberg
Copy link
Member

seberg commented Sep 22, 2021

Mainly bringing it up in case it interests you, @anntzer. But part of the reason the momentum has stalled a bit here is that @rossbar and I have been looking to pushing forward npreadtext with the goal of replacing np.loadtxt by moving it to C: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/X4AU2DUDDNA44HTEFDQXJLC24E6MDEW3/

That does not have to stand in the way here though. But, it would give us a good speed-up and additionally allows supporting new features, such as quote='"' (and other csv.Dialect features in the future), or even user provided C-parsers.

@anntzer
Copy link
Contributor Author

anntzer commented Sep 22, 2021

That sounds great; I guess it depends on the timescale over which you think npreadtext will make it into numpy. From a quick test npreadtext is faster than loadtxt even with the improvements here, but I am slightly worried that including a large chunk of C may take a while to review (wearing my matplotlib dev hat here), whereas the PR here may be faster to review (although it also involves some tricks).

So if you think npreadtext can be merged relatively quickly (wrt. numpy's release schedule), I am fine with closing this PR and its followup; otherwise, perhaps it can still go in as a temporary stopgap improvement.

@seberg
Copy link
Member

seberg commented Jan 16, 2022

Closing, as superseded by gh-20580, I don't think it is helpful to keep this open, unless the other PR gets rejected (which at this point seems a very long shot, I think it is finished and good).

@seberg seberg closed this Jan 16, 2022
@anntzer anntzer deleted the loadtxtflatdtype branch January 16, 2022 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants