Proposal for improving remote data format inference #1653

pierrecamilleri · 2024-04-17T10:25:48Z

Context

An issue related to this one has already been submitted #1646

I've taken the liberty of opening a new issue more specifically linked to the format inference problem, for the sake of clarity. If you'd prefer me to continue with the above issue instead and close this one, let me know and I'll take care of it.

Issue

The file format is guessed through the file extension (if I am not mistaken, in this function). If this seems, at my level of knowledge of file storage, a good strategy to guess a local file format, it falls short for many use cases of remote (at least over http(s)) csv resources.

Indeed, many APIs do not have an explicit extension when offering csv files.

Issue reproduction

With frictionless v4.40.11 :

$ frictionless describe https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv

[...]
format: ''
[...]

(The problem remains in v5, tested with v5.16.1, but I could not find how to reproduce this output)

Workaround

As mentionned in this comment, the workaround is to explicitly provide the format.

Proposal

http(s) response formats are usually in the response's Content-Type header.

It would seem appropriate to use this information to infer the file format.

e.g. looking at the headers of the request of the above url (curl -v https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv) indeed shows :

content-type: text/csv; charset=utf-8

Some additional improvements could be made using the response headers, as we can see that the encoding is also mentionned, and we can find e.g. a more relevant filename in the Content-Disposition header :

content-disposition: attachment; filename="244400610_subventions_liste.csv"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for improving remote data format inference #1653

Proposal for improving remote data format inference #1653

pierrecamilleri commented Apr 17, 2024 •

edited

Proposal for improving remote data format inference #1653

Proposal for improving remote data format inference #1653

Comments

pierrecamilleri commented Apr 17, 2024 • edited

Context

Issue

Issue reproduction

Workaround

Proposal

pierrecamilleri commented Apr 17, 2024 •

edited