Consider adding `referencing.jsonschema.SchemaValidatingRegistry` #68

Julian · 2023-08-09T13:02:26Z

It might be a nice convenience to have an object with the same interface as referencing.Registry but which automatically validated schemas (Resources) on insertion.

Specifically an object where (in pseudocode):

import referencing.jsonschema

valid_schema = referencing.jsonschema.DRAFT7.create_resource({"type": "integer"})
invalid_schema = referencing.jsonschema.DRAFT7.create_resource({"type": 37})

registry = referencing.jsonschema.SchemaValidatingRegistry()
valid_schema @ registry  # fine
invalid_schema @ registry # -> big boom!

This would be an inverted dependency on jsonschema (the library) -- because we'd call DraftNValidator.check_schema each time a resource was added to the SchemaValidatingRegistry.

We should likely allow for parametrizing looking up the right object for validation (i.e. fetching which metaschema to use and what function to call with it), especially considering that this library (referencing) is meant to be JSON Schema implementation agnostic.

The text was updated successfully, but these errors were encountered:

Julian · 2023-08-09T13:30:18Z

A further consideration I suppose is that whilst we don't yet have any OpenAPI / AsyncAPI-specific support, when we do, that support may also benefit from such an object, which is possibly further argument to not couple at all to jsonschema and think about how to do this quite generically while still making it easy for users to use.

sirosen · 2024-01-19T22:14:39Z

Regarding how this relates back to #119 (I think it does relate; I think it's a good fit!), I don't think it's necessary or desriable that all of the validation happens when a resource is added to the registry. You probably already have this idea in mind, but I'd allow for

invalid_schema = ...
registry = invalid_schema @ registry  # fine
registry.lookup(...)  # big boom!

as valid/correct behavior.

For jsonschema, specifically, my worry would be that calling check_schema on every subschema could be very slow, and would effectively re-check many parts of a document many times. I think that might only be fully needed at the top-level. But not checking every subschema seems like it's a less clean design. Perhaps this is solvable with a "simple" caching version of check_schema?

Julian · 2024-01-20T15:41:20Z

(I have not fully specced this out so definitely definitely your thoughts are even more welcome than they always are, and also as usual I'll respond just with first thoughts so it's possible I'll change my mind quite easily :)

For jsonschema, specifically, my worry would be that calling check_schema on every subschema could be very slow, and would effectively re-check many parts of a document many times.

The idea I think was to tie this to registry .crawl.ing -- i.e. that a crawled SchemaValidatingRegistry will have the invariant (that all resources are valid) guaranteed, and an uncrawled one not. That also I think solves the "slow" case, as that should™ guarantee we only check every (sub)schema once (and thereby not need any caching -- I'm loathe to even think about caching in jsonschema -- if/when we need that, maybe as part of Dialect v2 work, we may as well then require schemas to be immutable dicts... But that too seems down the line.).

sirosen · 2024-01-24T19:05:52Z

Am I understanding correctly that crawl() would eagerly load and validate all $refs for a schema?
That's a potential problem for certain schemas, which may be factored out into many remote refs with the expectation that the validator will usually not have to load all of them. For a simple example:

"anyOf": [
  {"$ref": "https://example.com/schemas/version5.json"},
  {"$ref": "https://example.com/schemas/version4.json"},
  {"$ref": "https://example.com/schemas/version3.json"},
  {"$ref": "https://example.com/schemas/version2.json"},
  {"$ref": "https://example.com/schemas/version1.json"}
]

With the expectation that an implementation will try to match against version5.json before trying version4.json or even loading it, etc.

The only project I've encountered in the wild where this is very significant is mkdocs-material. They have a plugin community which provides their own schemas. As a result, $ref resolution can crawl dozens-to-hundreds of files! 😬

So that's sort of a question / input here: I think lazy evaluation is desirable for some users. Am I following the idea of crawl() correctly?

Julian · 2024-01-24T19:26:28Z

And in that case you want to essentially defer checking the schema is valid until the first time it's looked up, right? If so I think that's definitely doable, basically defer validation until "crawl-time" but not force you to crawl all the schemas sounds like "ideal" behavior.

Julian added the enhancement New feature or request label Aug 9, 2023

Julian mentioned this issue Jan 19, 2024

Resolving a $ref to a nested definition can fail to catch that a (bad) subschema is neither bool nor object, resulting in an AttributeError #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding `referencing.jsonschema.SchemaValidatingRegistry` #68

Consider adding `referencing.jsonschema.SchemaValidatingRegistry` #68

Julian commented Aug 9, 2023 •

edited

Julian commented Aug 9, 2023

sirosen commented Jan 19, 2024

Julian commented Jan 20, 2024 •

edited

sirosen commented Jan 24, 2024

Julian commented Jan 24, 2024

Consider adding referencing.jsonschema.SchemaValidatingRegistry #68

Consider adding referencing.jsonschema.SchemaValidatingRegistry #68

Comments

Julian commented Aug 9, 2023 • edited

Julian commented Aug 9, 2023

sirosen commented Jan 19, 2024

Julian commented Jan 20, 2024 • edited

sirosen commented Jan 24, 2024

Julian commented Jan 24, 2024

Consider adding `referencing.jsonschema.SchemaValidatingRegistry` #68

Consider adding `referencing.jsonschema.SchemaValidatingRegistry` #68

Julian commented Aug 9, 2023 •

edited

Julian commented Jan 20, 2024 •

edited