Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive model with discriminated union has exponential time and space complexity for validation #8709

Closed
1 task done
palle-k opened this issue Feb 2, 2024 · 20 comments · Fixed by #8904
Closed
1 task done
Assignees
Labels
bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation
Milestone

Comments

@palle-k
Copy link

palle-k commented Feb 2, 2024

Initial Checks

  • I confirm that I'm using Pydantic V2

Description

When validating the recursive model given in the example, Pydantic has exponential time and space complexity, meaning that validation time and memory usage may grow indefinitely and rapidly depending on model depth, even when using a discriminator field.

Given the presence of a discriminator field, this should not be the case.

A model like this can potentially be abused in a denial-of-service attack, if the data passed to the validator can be controlled by the user.

Example Code

from __future__ import annotations
from typing import Literal, Annotated
from pydantic import Field, TypeAdapter, BaseModel

class NestedState(BaseModel):
    state_type: Literal["nested"]
    substate: AnyState

# If this type is left out, the model behaves normally again
class LoopState(BaseModel):
    state_type: Literal["loop"]
    substate: AnyState

class LeafState(BaseModel):
    state_type: Literal["leaf"]

AnyState = Annotated[NestedState | LoopState | LeafState, Field(..., discriminator="state_type")]

def build_nested_state(n):
    if n <= 0:
        return {"state_type": "leaf"}
    else:
        return {"state_type": "loop", "substate": {"state_type": "nested", "substate": build_nested_state(n-1)}}
        
adapter = TypeAdapter(AnyState)

# the next statement takes around 0.8s
adapter.validate_python(build_nested_state(9))

# the next statement takes around 3.5s
adapter.validate_python(build_nested_state(10))

# the next statement takes around 12.5s
adapter.validate_python(build_nested_state(11))

Python, Pydantic & OS Version

pydantic version: 2.6.0
        pydantic-core version: 2.16.1
          pydantic-core build: profile=release pgo=true
                 install path: /<redacted>/venv-3.11/lib/python3.11/site-packages/pydantic
               python version: 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)]
                     platform: macOS-14.3-arm64-arm-64bit
             related packages: fastapi-0.105.0 typing_extensions-4.8.0 pydantic-settings-2.1.0
                       commit: unknown
@palle-k palle-k added bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation labels Feb 2, 2024
@palle-k palle-k changed the title Recursive model with discriminated union requires exponential time and space complexity for validation Recursive model with discriminated union has exponential time and space complexity for validation Feb 5, 2024
@sydney-runkle
Copy link
Member

@palle-k,

Thanks for reporting this. @davidhewitt, could you please take a look at this performance issue when you have a chance? Thank you!

@davidhewitt
Copy link
Contributor

It looks like there's a potential schema-building bug here.

The core schema looks like this:

{'definitions': [{'cls': <class '__main__.NestedState'>,
                  'config': {'title': 'NestedState'},
                  'custom_init': False,
                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False,
                               'pydantic_js_annotation_functions': [],
                               'pydantic_js_functions': [functools.partial(<function modify_model_json_schema at 0x7fba5440fce0>, cls=<class '__main__.NestedState'>),
                                                         <bound method BaseModel.__get_pydantic_json_schema__ of <class '__main__.NestedState'>>]},
                  'ref': '__main__.NestedState:94515929304032',
                  'root_model': False,
                  'schema': {'computed_fields': [],
                             'fields': {'state_type': {'metadata': {'pydantic_js_annotation_functions': [<function get_json_schema_update_func.<locals>.json_schema_update_func at 0x7fba54486fc0>],
                                                                    'pydantic_js_functions': []},
                                                       'schema': {'expected': ['nested'],
                                                                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                  'type': 'literal'},
                                                       'type': 'model-field'},
                                        'substate': {'metadata': {'pydantic_js_annotation_functions': [<function get_json_schema_update_func.<locals>.json_schema_update_func at 0x7fba54487100>],
                                                                  'pydantic_js_functions': []},
                                                     'schema': {'default': Ellipsis,
                                                                'schema': {'choices': [{'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.NestedState:94515929304032',
                                                                                        'type': 'definition-ref'},
                                                                                       {'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.LoopState:94515929369184',
                                                                                        'type': 'definition-ref'},
                                                                                       {'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.LeafState:94515929326848',
                                                                                        'type': 'definition-ref'}],
                                                                           'metadata': {'pydantic.internal.needs_apply_discriminated_union': True,
                                                                                        'pydantic.internal.union_discriminator': 'state_type'},
                                                                           'type': 'union'},
                                                                'type': 'default'},
                                                     'type': 'model-field'}},
                             'model_name': 'NestedState',
                             'type': 'model-fields'},
                  'type': 'model'},
                 {'cls': <class '__main__.LoopState'>,
                  'config': {'title': 'LoopState'},
                  'custom_init': False,
                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False,
                               'pydantic_js_annotation_functions': [],
                               'pydantic_js_functions': [functools.partial(<function modify_model_json_schema at 0x7fba5440fce0>, cls=<class '__main__.LoopState'>),
                                                         <bound method BaseModel.__get_pydantic_json_schema__ of <class '__main__.LoopState'>>]},
                  'ref': '__main__.LoopState:94515929369184',
                  'root_model': False,
                  'schema': {'computed_fields': [],
                             'fields': {'state_type': {'metadata': {'pydantic_js_annotation_functions': [<function get_json_schema_update_func.<locals>.json_schema_update_func at 0x7fba544871a0>],
                                                                    'pydantic_js_functions': []},
                                                       'schema': {'expected': ['loop'],
                                                                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                  'type': 'literal'},
                                                       'type': 'model-field'},
                                        'substate': {'metadata': {'pydantic_js_annotation_functions': [<function get_json_schema_update_func.<locals>.json_schema_update_func at 0x7fba54487380>],
                                                                  'pydantic_js_functions': []},
                                                     'schema': {'default': Ellipsis,
                                                                'schema': {'choices': [{'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.NestedState:94515929304032',
                                                                                        'type': 'definition-ref'},
                                                                                       {'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.LoopState:94515929369184',
                                                                                        'type': 'definition-ref'},
                                                                                       {'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                                        'schema_ref': '__main__.LeafState:94515929326848',
                                                                                        'type': 'definition-ref'}],
                                                                           'metadata': {'pydantic.internal.needs_apply_discriminated_union': True,
                                                                                        'pydantic.internal.union_discriminator': 'state_type'},
                                                                           'type': 'union'},
                                                                'type': 'default'},
                                                     'type': 'model-field'}},
                             'model_name': 'LoopState',
                             'type': 'model-fields'},
                  'type': 'model'},
                 {'cls': <class '__main__.LeafState'>,
                  'config': {'title': 'LeafState'},
                  'custom_init': False,
                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False,
                               'pydantic_js_annotation_functions': [],
                               'pydantic_js_functions': [functools.partial(<function modify_model_json_schema at 0x7fba5440fce0>, cls=<class '__main__.LeafState'>),
                                                         <bound method BaseModel.__get_pydantic_json_schema__ of <class '__main__.LeafState'>>]},
                  'ref': '__main__.LeafState:94515929326848',
                  'root_model': False,
                  'schema': {'computed_fields': [],
                             'fields': {'state_type': {'metadata': {'pydantic_js_annotation_functions': [<function get_json_schema_update_func.<locals>.json_schema_update_func at 0x7fba54484fe0>],
                                                                    'pydantic_js_functions': []},
                                                       'schema': {'expected': ['leaf'],
                                                                  'metadata': {'pydantic.internal.needs_apply_discriminated_union': False},
                                                                  'type': 'literal'},
                                                       'type': 'model-field'}},
                             'model_name': 'LeafState',
                             'type': 'model-fields'},
                  'type': 'model'}],
 'schema': {'choices': {'leaf': {'schema_ref': '__main__.LeafState:94515929326848',
                                 'type': 'definition-ref'},
                        'loop': {'schema_ref': '__main__.LoopState:94515929369184',
                                 'type': 'definition-ref'},
                        'nested': {'schema_ref': '__main__.NestedState:94515929304032',
                                   'type': 'definition-ref'}},
            'discriminator': 'state_type',
            'from_attributes': True,
            'metadata': {'pydantic.internal.needs_apply_discriminated_union': True,
                         'pydantic.internal.union_discriminator': 'state_type'},
            'strict': False,
            'type': 'tagged-union'},
 'type': 'definitions'}

In particular the interesting bit to me is that we have 3 definitions instead of 4; the AnyState union is inlined in each model. But in the inlined model copies it is 'type': 'union' instead of 'type': 'tagged-union', which would explain where there's linear search rather than tagged dispatch at each level.

Hence the exponential blowup.

cc @adriangb as the schema building expert ;)

@adriangb
Copy link
Member

adriangb commented Feb 8, 2024

The schema simplifies to:

{
    "type": "definitions",
    "schema": {
        "type": "tagged-union",
        "choices": {
            "nested": {
                "type": "definition-ref",
                "schema_ref": "__main__.NestedState:4587657680"
            },
            "loop": {
                "type": "definition-ref",
                "schema_ref": "__main__.LoopState:4587670848"
            },
            "leaf": {
                "type": "definition-ref",
                "schema_ref": "__main__.LeafState:5124597648"
            }
        },
        "discriminator": "state_type"
    },
    "definitions": [
        {
            "type": "model",
            "cls": "__main__.LeafState",
            "schema": {
                "type": "model-fields",
                "fields": {
                    "state_type": {
                        "type": "model-field",
                        "schema": {
                            "type": "literal",
                            "expected": [
                                "leaf"
                            ]
                        }
                    }
                },
                "model_name": "LeafState"
            },
            "ref": "__main__.LeafState:5124597648"
        },
        {
            "type": "model",
            "cls": "__main__.LoopState",
            "schema": {
                "type": "model-fields",
                "fields": {
                    "state_type": {
                        "type": "model-field",
                        "schema": {
                            "type": "literal",
                            "expected": [
                                "loop"
                            ]
                        }
                    },
                    "substate": {
                        "type": "model-field",
                        "schema": {
                            "type": "default",
                            "schema": {
                                "type": "union",
                                "choices": [
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.NestedState:4587657680"
                                    },
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.LoopState:4587670848"
                                    },
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.LeafState:5124597648"
                                    }
                                ]
                            }
                        }
                    }
                }
            },
            "ref": "__main__.LoopState:4587670848"
        },
        {
            "type": "model",
            "cls": "__main__.NestedState",
            "schema": {
                "type": "model-fields",
                "fields": {
                    "state_type": {
                        "type": "model-field",
                        "schema": {
                            "type": "literal",
                            "expected": [
                                "nested"
                            ]
                        }
                    },
                    "substate": {
                        "type": "model-field",
                        "schema": {
                            "type": "default",
                            "schema": {
                                "type": "union",
                                "choices": [
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.NestedState:4587657680"
                                    },
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.LoopState:4587670848"
                                    },
                                    {
                                        "type": "definition-ref",
                                        "schema_ref": "__main__.LeafState:5124597648"
                                    }
                                ]
                            }
                        }
                    }
                }
            },
            "ref": "__main__.NestedState:4587657680"
        }
    ]
}

The issue is that the tagged union is not carried along so it's loss after the first union and thus performance degrades.
So yes it's a schema generation bug.

@sydney-runkle is that enough to go off of? I'm guessing we're not doing the apply_discrimnators stuff correctly.

@sydney-runkle
Copy link
Member

Hmm, this looks similar to this issue: #8688, which we fixed here #8702 for the 2.6.1 release. @palle-k, could you please confirm whether or not this issue has been resolved with 2.6.1?

@davidhewitt
Copy link
Contributor

Afraid I'm reproducing this on main 😢

@palle-k
Copy link
Author

palle-k commented Feb 8, 2024

@sydney-runkle unfortunately still reproducible with 2.6.1, as David already noticed.

@sydney-runkle
Copy link
Member

@davidhewitt, ah rats, ok 🐀.

@rossmacarthur
Copy link

+1 this seems pretty critical 😬, is there any workaround?

@rossmacarthur
Copy link

rossmacarthur commented Feb 13, 2024

This is a regression because I don't see this issue with

pydantic==2.4.2
pydantic_core==2.10.1

Unfortunately I can't downgrade because my dependencies have updated to >2.5.0

@sydney-runkle
Copy link
Member

@rossmacarthur,

Thanks for the +1. I've added to the list of issues we'll attempt to fix for 2.6.2. Hopefully we can get that release out next week. I'll keep you updated as we progress. I don't currently know of a workaround for this performance issue.

@davidhewitt
Copy link
Contributor

Here's a somewhat clumsy workaround:

from __future__ import annotations
from typing import Any, Literal, Annotated
from pydantic import AfterValidator, Field, TypeAdapter, BaseModel
import time

def validate_any_state(value: Any):
    return _ADAPTER.validate_python(value)

AnyState = Annotated[Any, AfterValidator(validate_any_state)]

class NestedState(BaseModel):
    state_type: Literal["nested"]
    substate: AnyState

# If this type is left out, the model behaves normally again
class LoopState(BaseModel):
    state_type: Literal["loop"]
    substate: AnyState

class LeafState(BaseModel):
    state_type: Literal["leaf"]

_ADAPTER = TypeAdapter(Annotated[NestedState | LoopState | LeafState, Field(..., discriminator="state_type")])

def build_nested_state(n):
    if n <= 0:
        return {"state_type": "leaf"}
    else:
        return {"state_type": "loop", "substate": {"state_type": "nested", "substate": build_nested_state(n-1)}}
        
for i in range(13):
    start = time.time()
    _ADAPTER.validate_python(build_nested_state(i))
    print(f"Time for {i}: {time.time() - start}")

It works by cutting the recursive model apart a bit crudely with the AfterValidator. This has downsides; it's opaque to pydantic-core so will affect the way that things like model_config are propagated and usually an AfterValidator is not as efficient as builtin design, if we can support it. In this case while we've got the performance bug the AfterValidator performs significantly better 😂

Disclaimer: I haven't checked if doing this creates other unexpected effects. YMMV.

@rossmacarthur
Copy link

I git bisected pydantic and pydantic-core and the bug was introduced in these commits

pydantic: bump pydantic-core to 2.14.0: fb8a594
pydantic-core: replace ultra_strict with new union implementation: pydantic/pydantic-core@f409e00

@davidhewitt
Copy link
Contributor

Thanks for checking that 👍

Given that those commits changed the way a non-tagged union is evaluated, that's not a huge surprise. I suspect that the truth is the schema building bug illustrated above existed before those commits, but the newer smart union implementation (which tries to be more accurate in which type gets produced) exacerbated the existing bug.

@sydney-runkle
Copy link
Member

Ah I see. I don't think we necessarily need to include this fix in 2.6.2 then, though the sooner we get a fix the better. I'll add this to the 2.7 milestone.

@sydney-runkle sydney-runkle added this to the v2.7.0 milestone Feb 15, 2024
@rossmacarthur
Copy link

If anyone else needs a workaround until this fixed I'm currently using a Discriminator with a custom function and annotating each type with Tag. The built schema always has a tagged-union in this case:

def get_state_type(v: Any) -> Hashable:
    if isinstance(v, dict):
        return v.get("state_type")
    return getattr(v, "state_type", None)

AnyState = Annotated[
    Annotated[NestedState, Tag("nested")]
    | Annotated[LoopState, Tag("loop")]
    | Annotated[LeafState, Tag("leaf")],
    Discriminator(get_state_type),
]

@palle-k
Copy link
Author

palle-k commented Feb 15, 2024

@sydney-runkle by the way, as I've mentioned previously, this issue can lead to a denial of service attack (and I did find apps which are vulnerable to this issue). Should we publish a security advisory somewhere?

@tarapt
Copy link

tarapt commented Feb 25, 2024

@shivenmian
Copy link

shivenmian commented Feb 25, 2024

Facing this issue as well at both the 01 and open-interpreter projects on my M2 Macbook Pro. Pydantic version 2.6.2, pydantic-core version 2.16.3.

Things were working fine for me yesterday, but since today morning, my FastAPI server does not start due to this Pydantic schema issue. This is also coming up in my OpenAI dependency, for which I've opened an issue here.

I did a memray memory profile to investigate further. Here is the report from open-interpreter, and here is the report from the 01 project. In both, the inflated memory seems to be due to this issue, since both FastAPI and OpenAI use Pydantic.

Could you give a timeline for this issue to be fixed? If this is not expected soon, could you give a patch I can apply in the meantime?

cc @KillianLucas

@sydney-runkle
Copy link
Member

Hi all,

Given the significance of this bug, we're working on a fix now, and will plan to do a 2.6.3 release with said fix, once we have it! We'll keep this thread updated as we progress.

@sydney-runkle
Copy link
Member

Hey all 😄,

After some extended debugging, we found a fix, and will release 2.6.3 today with the changes. Please ping me if you're experiencing any other issues. Happy coding! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants