Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support string columns in load_csv #3320

Open
jl-wynen opened this issue Nov 8, 2023 · 2 comments
Open

Support string columns in load_csv #3320

jl-wynen opened this issue Nov 8, 2023 · 2 comments

Comments

@jl-wynen
Copy link
Member

jl-wynen commented Nov 8, 2023

We are not handling strings well.
Given a file like

idx,name
1,a
2,b
3,c

we get

>>> sc.io.load_csv('asd.csv')
<scipp.Dataset>
Dimensions: Sizes[row:3, ]
Data:
  idx                         int64  [dimensionless]  (row)  [1, 2, 3]
  name                     PyObject        <no unit>  (row)  [a, b, c]

The problem is that the dtype of name is PyObject. I think this happens because pandas defaults to an object dtype. If it used StringDtype, this might work. See https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

@SimonHeybrock SimonHeybrock changed the title Support string columns in lead_csv Support string columns in load_csv Nov 9, 2023
@YooSunYoung
Copy link
Member

I also ran into this problem when I was using from_pandas...
I tried to set the dtype as string if all elements in the column are string.
But I'm not sure what should be done when there are int and string.
Should it just fall back to PyObject...?

@YooSunYoung
Copy link
Member

Or... can we enable astype from PyObject to string if possible...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants