Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV files with header break with delim_whitespace and skiprows using the C-engine #18692

Closed
rgieseke opened this issue Dec 8, 2017 · 8 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@rgieseke
Copy link
Contributor

rgieseke commented Dec 8, 2017

Code Sample, a copy-pastable example if possible

# Python 3
import pandas as pd
from io import StringIO
data = """Meta:
x " <- Space before quote char

id  value
1  10
2  20
"""
pd.read_csv(StringIO(data), delim_whitespace=True, skiprows=3)

Problem description

When there is a quote char with a space before and using delim_whitespace=True and skiprows reading a CSV file breaks with

EmptyDataError: No columns to parse from file

Expected Output

It should simply skip the header rows.

When using the Python engine it works, so this seems to be a problem with the C-based parser, possibly related to the behaviour introduced in #12900

# This works (with the Python engine):
import pandas as pd
from io import StringIO
data = """Meta:
x " <- Space before quote char

id  value
1  10
2  20
"""
pd.read_csv(StringIO(data), delim_whitespace=True, skiprows=3, engine="python")

My real data has something like

x = "   "

I also tested this with current master.

Output of pd.show_versions()


commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.4
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

@gfyoung can you have a look

@jreback jreback added the IO CSV read_csv, to_csv label Dec 10, 2017
@gfyoung
Copy link
Member

gfyoung commented Dec 11, 2017

@rgieseke : Thanks for reporting this! Handling malformed rows is just not an easy question, and I agree here that indeed we should handle this case.

#12900 was a design choice on our part to ensure that quoted lines got fully skipped, even if they had line-terminators within them. In your case, because your quoted line never properly terminates, read_csv will skip over everything.

Surprised to see that the Python engine is okay with this. Line skipping behavior is hard to understand there because it's masked away in the Python language library itself.

My feeling is that we can add a second parameter called parse_bad_lines (which has been mentioned before in other issues) in which we can handle the behavior of #12900 but still allow for your example to work. Adding parameters will become a lot more palatable once we remove a wave of deprecated ones.

@gfyoung gfyoung added the Compat pandas objects compatability with Numpy or Python functions label Dec 11, 2017
@rgieseke
Copy link
Contributor Author

@gfyoung Thanks for the quick feedback!

Here is a more real-world example from the data where I encountered the problem (simplified, this is Fortran namelist metadata plus whitespace separated columns), it's not really malformed, just having a space within the quotes. Again, with engine="python" it works. And without the delim_whitespace and commas as separator there is also no problem.

import pandas as pd
from io import StringIO
data = """&THISFILE_SPECIFICATIONS
 THISFILE_UNITS="K ",
 /
      YEARS     GLOBAL       
      1765      0.00000000E+00
"""
pd.read_csv(StringIO(data), skiprows=3, delim_whitespace=True)

If there were newlines in the header to be skipped, wouldn't it be okay to treat them as newlines? If one has a weird header one wants to be skipped, one would need to check anyway if the first line is correctly identified.

@gfyoung
Copy link
Member

gfyoung commented Dec 11, 2017

Indeed, this example is harder to explain away, since it isn't particularly malformed in this case...have a look at the CParser code to see where the discrepancy is arising.

If there were newlines in the header to be skipped, wouldn't it be okay to treat them as newlines? If one has a weird header one wants to be skipped, one would need to check anyway if the first line is correctly identified.

Not sure I fully understand your question here.

@rgieseke
Copy link
Contributor Author

rgieseke commented Dec 11, 2017

Not sure I fully understand your question here.

Sorry, that was hard to parse ... I hadn't thought of the usecase of actually wanting to skip a number of rows. I only ever use skiprows to get rid of meta information in header lines.

@gfyoung
Copy link
Member

gfyoung commented Dec 11, 2017

Sorry, that was hard to parse ... I hadn't thought of the usecase of actually wanting to skip a number of rows. I only ever use skiprows to get rid of meta informationin header lines.

Ah, gotcha. In any case, FWIW, if you remove the skiprows parameter, Python can still read the input, while the C engine still can't. Thus, that definitely points to some kind of parsing issue.

@cip
Copy link

cip commented May 3, 2018

I've experienced a similar issue when using skiprows to skip "corrupt" lines in the csv files.
In this test case no exception is thrown, but some rows are just missing.

Tested with pandas 0.22.0 and 0.19.2.

Perhaps adding an argument to disable parsing of quotes in skipped rows is an option to fix this?
Personally I'd prefer if this would be the default behaviour, and the argument can be used to enable parsing of quotes in skipped rows.

Example

import pandas as pd
from io import StringIO
import csv
pd.__version__
'0.22.0'
data="""1 2 3
a"b "
4 5 6
""" 
print(data)
1 2 3
a"b "
4 5 6
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1])
0 1 2
0 1 2 3
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], engine="python")
0 1 2
0 1 2 3
1 4 5 6
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], quoting=csv.QUOTE_NONE)
0 1 2
0 1 2 3
1 4 5 6

@mroeschke mroeschke added Bug and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020
@mroeschke
Copy link
Member

delim_whitespace has been deprecated and will be removed in pandas 3.0 so closing as wont fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants