Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for improving remote data format inference #1653

Open
pierrecamilleri opened this issue Apr 17, 2024 · 0 comments
Open

Proposal for improving remote data format inference #1653

pierrecamilleri opened this issue Apr 17, 2024 · 0 comments

Comments

@pierrecamilleri
Copy link

pierrecamilleri commented Apr 17, 2024

Context

An issue related to this one has already been submitted #1646

I've taken the liberty of opening a new issue more specifically linked to the format inference problem, for the sake of clarity. If you'd prefer me to continue with the above issue instead and close this one, let me know and I'll take care of it.

Issue

The file format is guessed through the file extension (if I am not mistaken, in this function). If this seems, at my level of knowledge of file storage, a good strategy to guess a local file format, it falls short for many use cases of remote (at least over http(s)) csv resources.

Indeed, many APIs do not have an explicit extension when offering csv files.

Issue reproduction

With frictionless v4.40.11 :

$ frictionless describe https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv
[...]
format: ''
[...]

(The problem remains in v5, tested with v5.16.1, but I could not find how to reproduce this output)

Workaround

As mentionned in this comment, the workaround is to explicitly provide the format.

Proposal

http(s) response formats are usually in the response's Content-Type header.

It would seem appropriate to use this information to infer the file format.

e.g. looking at the headers of the request of the above url (curl -v https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv) indeed shows :

content-type: text/csv; charset=utf-8

Some additional improvements could be made using the response headers, as we can see that the encoding is also mentionned, and we can find e.g. a more relevant filename in the Content-Disposition header : 

content-disposition: attachment; filename="244400610_subventions_liste.csv"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant