Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to validate date and date-time with jsonschema #420

Open
rennerocha opened this issue Sep 19, 2023 · 2 comments
Open

Unable to validate date and date-time with jsonschema #420

rennerocha opened this issue Sep 19, 2023 · 2 comments

Comments

@rennerocha
Copy link
Collaborator

After #358, the validation of date fields using jsonschema is not working as before. Spidermon was serializing date fields into strings (https://github.com/scrapinghub/spidermon/pull/358/files#diff-7937ac85a30630fe837b9c133f4459ee590680bb5dfce72775db6005f2b45f51L142), so when injected into jsonschema validators, the date and date-time checkers (https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-formats) didn't work as expected if the item contains a datetime.date or a datetime.datetime instance.

Given the code:

import datetime
from jsonschema._format import FormatChecker
from jsonschema.validators import validator_for
from spidermon.contrib.scrapy.pipelines import ItemValidationPipeline

format_checker = FormatChecker()

schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "date": {
            "description": "Date of the gazzete",
            "type": "string",
            "format": "date"
        }
    },
    "required": [
        "date",
    ]
}

validator_cls = validator_for(schema)
validator = validator_cls(schema=schema, format_checker=format_checker)
original_data = {
    'date': datetime.date.today()
}

Validating with spidermon 1.20.0

item_adapter = ItemAdapter(original_data)
item_dict = item_adapter.asdict()
>>> errors = validator.iter_errors(item_dict)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]

With spidermon 1.17.0

>>> data = ItemValidationPipeline._convert_item_to_dict(_, original_data)
>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
[]

Validating with spidermon 1.20.0

>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]
@rennerocha
Copy link
Collaborator Author

This change has the potential to break applications that are relying that Spidermon will understand date and datetime values and validate them with jsonschema.

To make it work, the user needs to manually serialize the date and datetime values in the items. But I am trying to figure out if there some solution that could be implemented in Spidermon side, to avoid this manipulation.

cc @VMRuiz @Gallaecio

@VMRuiz
Copy link
Collaborator

VMRuiz commented May 6, 2024

Hey, sorry for getting back to you late on this. I'm not entirely sure if we should change anything here. If you want your field to be a string with a date format, you could scrape it that way or set up an item pipeline to automatically convert datetime objects into strings if that's easier for you.

I don't think Spidermon should make that decision for you by default. But I'm open to the idea of adding it as an opt-in feature where you can configure auto-casting methods for your fields. It could come in handy, especially when you want to validate with Jsonschema but still keep the original data types, like for binary RPC calls.

What do you think @Gallaecio @curita ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants