Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about why this file is not reported invalid #1646

Open
thbar opened this issue Mar 17, 2024 · 1 comment
Open

Confusion about why this file is not reported invalid #1646

thbar opened this issue Mar 17, 2024 · 1 comment

Comments

@thbar
Copy link

thbar commented Mar 17, 2024

Hello,

I'm using frictionless as a CLI at the moment (version 5.16.1).

I have files which do not respect a schema, for which the CLI reports no errors, and I'm a bit puzzled about why.

frictionless validate https://www.data.gouv.fr/fr/datasets/r/fde557ec-b96e-49a5-9282-31407296282c \
--schema https://schema.data.gouv.fr/schemas/etalab/schema-irve-dynamique/latest/schema-dynamique.json \
--json

Returns:

{
  "valid": true,
  "stats": {
    "tasks": 1,
    "errors": 0,
    "warnings": 0,
    "seconds": 0.692
  },
  "warnings": [],
  "errors": [],
  "tasks": [
    {
      "name": "fde557ec-b96e-49a5-9282-31407296282c",
      "type": "file",
      "valid": true,
      "place": "https://www.data.gouv.fr/fr/datasets/r/fde557ec-b96e-49a5-9282-31407296282c",
      "labels": [],
      "stats": {
        "errors": 0,
        "warnings": 0,
        "seconds": 0.692
      },
      "warnings": [],
      "errors": []
    }
  ]
}

I get the same output if I download the files first (attached here to freeze them) & follow the required redirect.

local-data-redirected.csv
local-schema.json

I am very puzzled. Am it using the CLI incorrectly? Is the schema too permissive for some reason I'm missing?

Thanks for any hint!

@AntoineAugusti
Copy link
Contributor

Hi Thibaut,

It seems this is a similar issue reported here.

Frictionless does not identify that the URL points to a CSV.

Updated the command (argument order) and added --format csv to make it work.

frictionless validate \
--schema https://schema.data.gouv.fr/schemas/etalab/schema-irve-dynamique/latest/schema-dynamique.json \
https://www.data.gouv.fr/fr/datasets/r/fde557ec-b96e-49a5-9282-31407296282c \
--json --format csv

github-merge-queue bot pushed a commit to etalab/transport-site that referenced this issue Mar 18, 2024
* Adapt setup to rely on now-incorporated app resources

* Leverage now existing HTTPClient

* Remove legacy

* Adapt remaining code to fix the script

* Add io_ansi_table via hybrid mix setup

* Improve reporting

* Improve reporting

* Add TODOs

* Avoid parsing CSV multiple times

* Show validity information

* Add local frictionless validator

* Help me understand

* Fix cache code

* Disable cache

* Add notes on frictionless

See:
- frictionlessdata/frictionless-py#1646

* Start parsing report

* Move to some non-git-enabled folder

* Shorten name

* Dump frictionless output (not always reliable)
github-merge-queue bot pushed a commit to etalab/transport-site that referenced this issue Jun 6, 2024
* Attempt to detect when frictionless did not really validate

See frictionlessdata/frictionless-py#1646

* Improve debugging

* Improve caching (well required)

* Help me when errors arise

* Provide better reports

* Run mix format

* Add extra aggregate config type

* Deserialize aggregate config item

* Wire aggregate type to proxy (with a TODO)

* Implement rendering for backoffice

* Display nothing and not a buggy link

* Add a first test

* Implement core part of fetching

* Run mix format

* Implement parsing & recomposing of ids only

* Fix warning

* Add note

* Implement test on aggregation

* Add TODOs for remaining cases

* Add moduledoc

* Move Telemetry around before refactoring the rest

* Extract aggregate processor to its own module

* Add module_doc

* Extract method

* Rename variable

* Fix typo

* Start passing external options around

* Handle limit_per_source and include_origin

* Create schema-irve-dynamique.json

* Improve query parameters handling

* Add dynamic schema introspection wrapper

* Add note

* Throw to interrupt for now

* Ensure we have exactly all headers - drop sub-stream if not

* Remove test code

* Add WIP quick and dirty checker

* Refactor for testing

* Add placeholder

* Add test fix

* WIP test

* Fix comment

* Allow to customize line separator

* Fix test

* Remove live checking from current PR scope

* Finalize test

* Run mix format

* Fix credo warning

* Fix credo warning

* Simplify code

* Revise my position on TODO

* Scope has been verified

* Expose TTL for main IRVE feed

* Seems to be clean at this point

* Do not enforce optional key (we provide a default value)

* Refactor before adding more scenarios

* Extract method more

* Rework method

* Move todos

* Make sure non-200 responses are dropped out of resulting consolidation

* Add todos

* Add TODO for timeouts (would take down everything)

* Catch & handle sub-feeds errors safely (with logging)

* Handle timeouts (> 5sec) in sub-feeds to avoid global 500

* Remove apparently dead code

* Extract code before reuse

* Fix too loose typing in sub-items

* Add missing line jump

* Adapt code for reworked items

* DRY configuration & fix broken tests

* Add support for sub-key

* Allow function override

* Implement cached sub-source queries

* Allow override of sub-item type

* Add useful notes

* Make sure EnforceTTL do not crash on sub-items

* Remove TODO and make backoffice work

* Use hard-coded alt config for staging

* Use variable

* Add missing documentation

* Make sure anyone can understand this test in 6 months

* Add note

* Add missing assertions

* Extract method "setup_remote_responses"

Refactoring before adding more test-cases, to DRY things out.

* Remove unused code

* Simplify code

* Remove satisfied TODO

* Improve tests

* Add note

* Use with_log & remove already tested stuff

* Prepare remaining work items

* Allow to format numbers with nil variant

* Fix missing test & adapt the others

* Add TODO on bogus code

* Complete TODO

* Fix broken "exception in remote" test case

* Add test-case for regular non-200 case

* Add 302 test-case (important & used)

* Remove implemented TODO

* Finalize main test

* Add a test (timeout case), find a bug, fix the bug

* Remove left-over

* Implement 2 remaining tests

* Remove TODO (now properly tested & fixed)

* Add comments for maintenance

* Reduce code & make rows unique

This could save a lot of debugging later.

* Comply with Credo

* Do not use nil (== :nil so confusing) key

* Review change: stop leaking :resource down the line

This will only lead to confusing uses in the future.

* Add documentation

* Add note for later

* Fix doc

* Remove redundant doc

* Add missing end of sentence

* Add doc

* Simplify code & fix test

* Gimme some typing

Co-authored-by: Antoine Augusti <antoine.augusti@transport.data.gouv.fr>

* Fix typo

Co-authored-by: Antoine Augusti <antoine.augusti@transport.data.gouv.fr>

* Run mix format

* Add well-needed parameter type

Co-authored-by: Antoine Augusti <antoine.augusti@transport.data.gouv.fr>

* Add missing namespace to module

* Add useful typing

* Move aggregate event construction to telemetry

* Simplify the construct

* Rename for consistency

* Refactor: simplify 302-redirect support code

Code review showed that passing function was not a clear way to support the different scenarios here. I remove the need for function passing, by supporting 302 redirect (and only them) directly in the Finch wrapper.

* Fix warning

* Add missing redirect tests for Finch wrapper

* Remove unclear comment

* Add note

* Add TODO

* Add failing test for telemetry bug

* Ensure no other telemetry event is sent

* Fix incorrect count of internal aggregated queries

* Add note

* Clarify un-registering of handlers

Also link to #3975 which can lead the maintainer to a problematic database situation.

* Remove note (this has been completed)

* DRY proxy request types

After verification, `@proxy_requests` is only referenced inside the same file, and only for unsorted guard checks.

* DRY cache separator

* Rollback partially incorrect DRYing

The same character is used, but 97800f3 incorrectly conflates telemetry events (metrics) with cache keys.

* Make sure to cover config parsing with tests

---------

Co-authored-by: Antoine Augusti <antoine.augusti@transport.data.gouv.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants