Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolving a $ref to a nested definition can fail to catch that a (bad) subschema is neither bool nor object, resulting in an AttributeError #119

Open
sirosen opened this issue Jan 17, 2024 · 4 comments

Comments

@sirosen
Copy link
Member

sirosen commented Jan 17, 2024

This originates downstream in python-jsonschema/check-jsonschema#376 . I was just able to look at the offending schema and trace things out enough to find the true cause. I've stuck with using the CLI for my reproducers, but I could work up some python for this if it would be useful.

check-jsonschema is using jsonschema+referencing. It keeps the jsonschema behavior of checking a schema against its metaschema. Therefore, the following schema is caught as invalid:

invalid-caught-by-metaschema
{
  "$schema": "http://json-schema.org/draft-07/schema",
  "type": "object",
  "definitions": {
    "foo": "invalid"
  },
  "properties": {
    "foo": {
      "$ref": "#/definitions/foo"
    }
  }
}

(Note how the definition for "foo" is a string, rather than a schema.)


However, if the invalid definition is nested, as follows, the metaschema check is not sufficient:

invalid-not-caught-by-metaschema
{
  "$schema": "http://json-schema.org/draft-07/schema",
  "type": "object",
  "definitions": {
    "sub": {
      "foo": "invalid"
    }
  },
  "properties": {
    "foo": {
      "$ref": "#/definitions/sub/foo"
    }
  }
}

As a result, it is possible, with jsonschema+referencing, to be handling this in a validator. The next step is to try to use it.
With the check-jsonschema CLI, we get the following trace (trimmed to relevant parts):

full-cli-traceback
$ check-jsonschema --schemafile badrefschema.json <(echo '{"foo": "bar"}')  
Traceback (most recent call last):
  
  <<<<<check-jsonschema invocation path shows here>>>>>

  File "/home/sirosen/projects/jsonschema/check-jsonschema/src/check_jsonschema/checker.py", line 73, in _build_result
    for err in validator.iter_errors(data):
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/validators.py", line 368, in iter_errors
    for error in errors:
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/_keywords.py", line 295, in properties
    yield from validator.descend(
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/validators.py", line 416, in descend
    for error in errors:
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/_keywords.py", line 274, in ref
    yield from validator._validate_reference(ref=ref, instance=instance)
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/validators.py", line 410, in descend
    for k, v in applicable_validators(schema):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/_legacy_keywords.py", line 17, in ignore_ref_siblings
    ref = schema.get("$ref")
          ^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

This shows an error in jsonschema (not quite at referencing yet!). I had to tinker around with things to find the right trigger conditions, but this seems to do the trick, as a nasty schema:

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "type": "object",
  "definitions": {
    "sub": {
      "foo": {
        "type": "object",
        "properties": {
          "bar": "invalid"
        }
      }
    }
  },
  "properties": {
    "foo": {
      "$ref": "#/definitions/sub/foo"
    }
  }
}

Now, trying to use it triggers an attempt to call .get() on a string:

$ check-jsonschema --schemafile badrefschema.json <(echo '{"foo": {"bar": "baz"}}')

(trimmed trace)

  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/jsonschema/validators.py", line 408, in descend
    resolver = self._resolver.in_subresource(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/referencing/_core.py", line 689, in in_subresource
    id = subresource.id()
         ^^^^^^^^^^^^^^^^
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/referencing/_core.py", line 225, in id
    id = self._specification.id_of(self.contents)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sirosen/projects/jsonschema/check-jsonschema/.venv/lib/python3.11/site-packages/referencing/jsonschema.py", line 50, in _legacy_dollar_id
    id = contents.get("$id")
         ^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

Proposed Resolution

Obviously, the best resolution is for such a schema to never get written, since it's invalid. 😁
And I'll be trying to advise the upstream provider of the schema against nesting things under definitions, since it weakens the kind of checking which is possible.

But should various referencing internals be more wary that an input may be a dict, a bool, or something unwanted/unexpected? Or should referencing check for bad inputs more aggressively when loading data?
The goal here is to have a better and clearer error in this case, not to ignore the malformed schema.

I think the best thing here is that one of the levels of document loading in Referencing checks if the value is dict | boolean. If we only consider JSON Schema, it could be a validator on Resource.contents. That makes the most sense to me, conceptually, for JSON Schema, but it ties in non-generic details of that spec (vs more general $ref resolution). Perhaps JSONSchemaResource(Resource) is the right thing, which adds said validation?

(NB: I'm reading a lot of the internals pretty quickly and for the first time, so my ideas here may not hold up.)

@Julian
Copy link
Member

Julian commented Jan 17, 2024

Thanks (as usual) for the detailed report and thoughts.

This isn't the first time I've thought about this sort of thing -- my first reaction which I still want to at least attempt pursuing is that this kind of thing is probably something that needs improving at the jsonschema.Validator layer -- specifically it's a real shame that the metaschemas are nowhere near tight enough to catch issues, especially of this sort. I have a bunch of plans there (including some wild ones like investigating what a dialect of JSON Schema is like that would be able to validate all the things the spec says are invalid) -- but to be short -- I think that's the first place I'd look to improve things. The main reason that hasn't happened (beyond normal "prioritization") is that the first step for that I think is collecting a good test suite of invalid schemas.

I'm going to leave this open regardless and have another think/look at it -- just sharing the first reactions.

@sirosen
Copy link
Member Author

sirosen commented Jan 19, 2024

This makes sense to me. It's just at a weird boundary where the contracts between components are not as clean as I would like -- I'd really like it if we could verify that a schema won't "break referencing" (raise internal errors) before passing it in.

The contract with referencing is, correct me if I'm wrong, "pass in valid/good data and it will work, invalid data may fail with internal errors and that's not a bug."
Which I think is a fine position for this library. It avoids unnecessary checks, stays fast and simple, etc.

But it makes it hard to decide how to handle this in an application which takes a schema as input. Sticking with check-jsonschema, what should the user-facing behavior be? Probably, "your schema is bad, exit(1)". Which is... sort of what happens. With a stacktrace! 😂
If validator.validate(instance) fails (looking at it from the jsonschema level) with an error due to a malformed schema, I feel like that should be some kind of custom error type.


Perhaps another way of looking at this is that retrieve=... lets me pass in a callable which can do arbitrary work, including arbitrary validation.

Should referencing support a hook which runs whenever a $ref is resolved, and then the caller can apply validation (and transformations?!) there? Separate question: if so, should jsonschema use that hook by default?

@Julian
Copy link
Member

Julian commented Jan 19, 2024

The contract with referencing is, correct me if I'm wrong, "pass in valid/good data and it will work, invalid data may fail with internal errors and that's not a bug."

Yes (which maybe is a funny contract, but happens to be the same one as jsonschema.Validator(schema) -- and its jsonschema.Validator.check_schema's job to help if you're unsure essentially whether you have good input.

Probably, "your schema is bad, exit(1)". Which is... sort of what happens. With a stacktrace! 😂

@Julian
Copy link
Member

Julian commented Jan 19, 2024

.. no idea what just sent that comment, I was mid-typing and hit nothing else -- anyways ...

Probably, "your schema is bad, exit(1)". Which is... sort of what happens. With a stacktrace! 😂

I think until jsonschema.Validator.check_schema is souped up I'd do some adhoc handling of this case in check-jsonschema -- or perhaps we could build up some private function that essentially wraps check_schema but which handles this too. The thing is it's not actually easy to catch without doing some validation work...

Perhaps another way of looking at this is that retrieve=... lets me pass in a callable which can do arbitrary work, including arbitrary validation.

I think maybe this is related to #68 -- i.e. having a registry where when you put resources in it, calls out to a library to ensure those resources are valid. That's the sort of thing such a callable woud do right?

Like I say in that issue, I think we'd still want to stay "agnostic" to validator libraries, but then yes I think having one in jsonschema that calls check_schema would be super intended!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants