CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

rgieseke · 2017-12-08T14:43:56Z

Code Sample, a copy-pastable example if possible

# Python 3
import pandas as pd
from io import StringIO
data = """Meta:
x " <- Space before quote char

id  value
1  10
2  20
"""
pd.read_csv(StringIO(data), delim_whitespace=True, skiprows=3)

Problem description

When there is a quote char with a space before and using delim_whitespace=True and skiprows reading a CSV file breaks with

EmptyDataError: No columns to parse from file

Expected Output

It should simply skip the header rows.

When using the Python engine it works, so this seems to be a problem with the C-based parser, possibly related to the behaviour introduced in #12900

# This works (with the Python engine):
import pandas as pd
from io import StringIO
data = """Meta:
x " <- Space before quote char

id  value
1  10
2  20
"""
pd.read_csv(StringIO(data), delim_whitespace=True, skiprows=3, engine="python")

My real data has something like

x = "   "

I also tested this with current master.

Output of `pd.show_versions()`

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.4
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-10T15:18:15Z

@gfyoung can you have a look

gfyoung · 2017-12-11T01:34:57Z

@rgieseke : Thanks for reporting this! Handling malformed rows is just not an easy question, and I agree here that indeed we should handle this case.

#12900 was a design choice on our part to ensure that quoted lines got fully skipped, even if they had line-terminators within them. In your case, because your quoted line never properly terminates, read_csv will skip over everything.

Surprised to see that the Python engine is okay with this. Line skipping behavior is hard to understand there because it's masked away in the Python language library itself.

My feeling is that we can add a second parameter called parse_bad_lines (which has been mentioned before in other issues) in which we can handle the behavior of #12900 but still allow for your example to work. Adding parameters will become a lot more palatable once we remove a wave of deprecated ones.

rgieseke · 2017-12-11T09:55:29Z

@gfyoung Thanks for the quick feedback!

Here is a more real-world example from the data where I encountered the problem (simplified, this is Fortran namelist metadata plus whitespace separated columns), it's not really malformed, just having a space within the quotes. Again, with engine="python" it works. And without the delim_whitespace and commas as separator there is also no problem.

import pandas as pd
from io import StringIO
data = """&THISFILE_SPECIFICATIONS
 THISFILE_UNITS="K ",
 /
      YEARS     GLOBAL       
      1765      0.00000000E+00
"""
pd.read_csv(StringIO(data), skiprows=3, delim_whitespace=True)

If there were newlines in the header to be skipped, wouldn't it be okay to treat them as newlines? If one has a weird header one wants to be skipped, one would need to check anyway if the first line is correctly identified.

gfyoung · 2017-12-11T16:02:34Z

Indeed, this example is harder to explain away, since it isn't particularly malformed in this case...have a look at the CParser code to see where the discrepancy is arising.

If there were newlines in the header to be skipped, wouldn't it be okay to treat them as newlines? If one has a weird header one wants to be skipped, one would need to check anyway if the first line is correctly identified.

Not sure I fully understand your question here.

rgieseke · 2017-12-11T16:09:39Z

Not sure I fully understand your question here.

Sorry, that was hard to parse ... I hadn't thought of the usecase of actually wanting to skip a number of rows. I only ever use skiprows to get rid of meta information in header lines.

gfyoung · 2017-12-11T16:13:41Z

Sorry, that was hard to parse ... I hadn't thought of the usecase of actually wanting to skip a number of rows. I only ever use skiprows to get rid of meta informationin header lines.

Ah, gotcha. In any case, FWIW, if you remove the skiprows parameter, Python can still read the input, while the C engine still can't. Thus, that definitely points to some kind of parsing issue.

cip · 2018-05-03T07:33:20Z

I've experienced a similar issue when using skiprows to skip "corrupt" lines in the csv files.
In this test case no exception is thrown, but some rows are just missing.

Tested with pandas 0.22.0 and 0.19.2.

Perhaps adding an argument to disable parsing of quotes in skipped rows is an option to fix this?
Personally I'd prefer if this would be the default behaviour, and the argument can be used to enable parsing of quotes in skipped rows.

Example

import pandas as pd
from io import StringIO
import csv

pd.__version__

'0.22.0'

data="""1 2 3
a"b "
4 5 6
"""

print(data)

1 2 3
a"b "
4 5 6

pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1])

	0	1	2
0	1	2	3

pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], engine="python")

	0	1	2
0	1	2	3
1	4	5	6

pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], quoting=csv.QUOTE_NONE)

	0	1	2
0	1	2	3
1	4	5	6

mroeschke · 2024-05-10T17:50:04Z

delim_whitespace has been deprecated and will be removed in pandas 3.0 so closing as wont fix

jreback added the IO CSV read_csv, to_csv label Dec 10, 2017

gfyoung added the Compat pandas objects compatability with Numpy or Python functions label Dec 11, 2017

mroeschke added Bug and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020

mroeschke closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

rgieseke commented Dec 8, 2017

jreback commented Dec 10, 2017

gfyoung commented Dec 11, 2017 •

edited

rgieseke commented Dec 11, 2017

gfyoung commented Dec 11, 2017 •

edited

rgieseke commented Dec 11, 2017 •

edited

gfyoung commented Dec 11, 2017

cip commented May 3, 2018

mroeschke commented May 10, 2024

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

Comments

rgieseke commented Dec 8, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Dec 10, 2017

gfyoung commented Dec 11, 2017 • edited

rgieseke commented Dec 11, 2017

gfyoung commented Dec 11, 2017 • edited

rgieseke commented Dec 11, 2017 • edited

gfyoung commented Dec 11, 2017

cip commented May 3, 2018

Example

mroeschke commented May 10, 2024

Output of `pd.show_versions()`

gfyoung commented Dec 11, 2017 •

edited

gfyoung commented Dec 11, 2017 •

edited

rgieseke commented Dec 11, 2017 •

edited