Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.xls support? And a weird header issue #259

Open
aodj opened this issue Feb 13, 2023 · 2 comments
Open

.xls support? And a weird header issue #259

aodj opened this issue Feb 13, 2023 · 2 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed priority

Comments

@aodj
Copy link

aodj commented Feb 13, 2023

Firstly, thanks for roapi! I've been playing around with it for a little while and really like it.

I've been trying to use it to proxy some datafiles containing information on currency codes as published by the body behind the ISO 4217 standard. The data is available in .xls or .xml format, but I can't get roapi to parse the .xls file. Opening it in Excel shows that the file has a couple of leading rows of dud data, in a non-tabular format that's not relevant to the data to be loaded.

When running it with roapi I get the following errors:

$ RUST_LOG=debug roapi --table "https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:26:06Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:26:06Z INFO  roapi::context] loading `uri(https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

Downloading the file locally gives the same error:

$ RUST_LOG=debug roapi --table "list-one.xls,format=xlsx,sheet_name=Active"
[2023-02-13T11:27:03Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:27:03Z INFO  roapi::context] loading `uri(list-one.xls)` as table `list-one`
Error: Error loading Xlsx table: Failed to load .xlsx file

If I save it as an .xlsx file I get an error about unparseable data:

$ RUST_LOG=debug RUST_BACKTRACE=1 roapi --table "list-one.xlsx,format=xlsx,sheet_name=Active"
[2023-02-13T11:28:07Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB
[2023-02-13T11:28:07Z INFO  roapi::context] loading `uri(list-one.xlsx)` as table `list-one`
Error: Arrow error: Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Caused by:
    Invalid argument error: column types must match schema types, expected Null but found Utf8 at column index 2

Do you think there's anything that might be done to support the loading of this file? Something like skiprows or header in Pandas would work. I've worked around it by processing the data with Pandas and saving it as a .parquet file, but I would like the ability to reference the current upstream file without having to process it.

Thanks for any input you can provide. I look forward to any suggestions you might have.

@dheerajmishra77
Copy link

dheerajmishra77 commented Feb 23, 2023

I am facing same issue with all .xlsx files. Even a simple CSV saved as XLSX (to rule out any data issue) is unable to be read by roapi
$ roapi -c /Users/dm/Desktop/roapi_test.yaml [2023-02-23T20:33:47Z DEBUG datafusion::execution::memory_manager] Creating memory manager with initial size 11744051.2 TB [2023-02-23T20:33:47Z INFO roapi::context] loading 'uri(/Users/dm/Desktop/Dremio/Dremio_Test/All_Kind_Files/Salary_Sheet1.xlsx)' as table 'EXCEL_TABLE' Error: Error loading Xlsx table: Failed to open .xlsx file.

@houqp houqp added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed priority labels Nov 20, 2023
@Maks-D
Copy link
Contributor

Maks-D commented Jan 7, 2024

Hi, @houqp , @aodj

I've prepared a PR #316 that fixes all errors described in this issue and also add possibility to customise range of rows/columns from spread sheet that will be available in roapi table.

houqp pushed a commit that referenced this issue Feb 1, 2024
This PR is a fix for issue #259

List of updates/fixes:
* module xlsx renamed to excel.
* Allow reading not only xlsx format but also xls, ods, xlsb
* Allow Excel DateTime format and transform it to arrow
Timestamp(Seconds, None)
* Allow using NULLs in any data types and use null value instead of
string "null"
* Fix issue with incorrect data type inference when multiple data types
are detected.
* Add possibility to specify data schema in config.
* Add new options: -
rows_range_start
 - rows_range_end
 - columns_range_start
 - columns_range_end
 - schema_inference_lines
* Make sheet_name optional and if it is not specified than use first
sheet by default

* Bump calamine crate to version 0.23.1 and add feature "dates"
(supporting for DateTime column format)

Documentation updates: roapi/docs#20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed priority
Projects
None yet
Development

No branches or pull requests

4 participants