Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic model without arguments leads to serialization data loss after V2 #6895

Closed
1 task done
MarkusSintonen opened this issue Jul 27, 2023 · 8 comments
Closed
1 task done
Assignees
Labels
bug V2 Bug related to Pydantic V2

Comments

@MarkusSintonen
Copy link
Contributor

MarkusSintonen commented Jul 27, 2023

Initial Checks

  • I confirm that I'm using Pydantic V2

Description

On our codebase we had accidental piece of code where a Pydantic generic model was missing arguments which leads to silent data loss on serialization. See below example how the data loss may happen. Gladly our test suite was able to catch this but opening this issue so you are aware of this (maybe a major) breaking change. Could we consider in V2 that initializing a generic model without arguments gives a validation failure or disallowing it to happen via mypy-plugin? Could also be a model-config parameter to disallow usage of generics with missing args.

Below example passes with V1.

Example Code

import json
from typing import Generic, TypeVar

import pytest
from pydantic import BaseModel


@pytest.mark.parametrize("dump_json", [False, True])
def test_data_loss(dump_json: bool):
    class InnerBase(BaseModel):
        a: str

    class Inner(InnerBase):
        b: str

    T = TypeVar("T", bound=InnerBase)

    class MyModel(BaseModel, Generic[T]):
        val: T

    # Type annotation for "obj" for mypy. But its forgotten in model creation, oooops!
    obj: MyModel[Inner] = MyModel(
        val=Inner(a="foo", b="bar")
    )
    assert obj.val.a == "foo"
    assert obj.val.b == "bar"  # Its there

    # Field 'b' disappears in serialization, oh my!
    # This works with proper MyModel[Inner]
    if dump_json:
        # AssertionError: assert {'val': {'a': 'foo'}} == {'val': {'a': 'foo', 'b': 'bar'}}
        assert obj.model_dump() == {"val": {"a": "foo", "b": "bar"}}
    else:
        #  AssertionError: assert {'val': {'a': 'foo'}} == {'val': {'a': 'foo', 'b': 'bar'}}
        assert json.loads(obj.model_dump_json()) == {"val": {"a": "foo", "b": "bar"}}

Python, Pydantic & OS Version

pydantic version: 2.1.1
        pydantic-core version: 2.4.0
          pydantic-core build: profile=release pgo=false mimalloc=true
                 install path: /Users/markussintonen/Library/Caches/pypoetry/virtualenvs/leak-test-VvUvdXxD-py3.11/lib/python3.11/site-packages/pydantic
               python version: 3.11.4 (main, Jul 17 2023, 14:44:40) [Clang 14.0.3 (clang-1403.0.22.14.1)]
                     platform: macOS-13.4.1-x86_64-i386-64bit
     optional deps. installed: ['typing-extensions']

Selected Assignee: @samuelcolvin

@MarkusSintonen MarkusSintonen added bug V2 Bug related to Pydantic V2 unconfirmed Bug not yet confirmed as valid/applicable labels Jul 27, 2023
@MarkusSintonen
Copy link
Contributor Author

MarkusSintonen commented Jul 27, 2023

This might be kind of related to #6882 This is just a different variation of the issue. I understand that in V2 its desirable to not "expose" fields of a derived model-instance when the field-model uses a super class of the derived one.

You are free to close the issue. But it would be nice to somehow capture the issue statically (mypy) or at creation time.

@dmontagu
Copy link
Contributor

dmontagu commented Jul 27, 2023

@MarkusSintonen thanks for another clear and thoughtful issue report.

So, I'll just point out at least two things you can do to help resolve this (but spoiler alert: I don't think they address your concern fully):

Option 1: Use SerializeAsAny (though I understand in this case that's not exactly what you want, you just want it to do a better job of 'inferring' the generic parameter). While you can do val: SerializeAsAny[T], and that resolves your issue, it does suffer from the fact that you can no longer enforce following a schema by parametrizing. And so, I actually think for your use case it may be more applicable to do: T = TypeVar("T", bound=SerializeAsAny[InnerBase]) — this will cause it to "ignore" the schema when serializing if the parameter wasn't explicitly specified on the type:

import json
from typing import Generic, TypeVar

from pydantic import BaseModel, SerializeAsAny

dump_json = False


class InnerBase(BaseModel):
    a: str


class Inner(InnerBase):
    b: str


T = TypeVar("T", bound=SerializeAsAny[InnerBase])


class MyModel(BaseModel, Generic[T]):
    val: T


print(MyModel(val=Inner(a="foo", b="bar")).model_dump())
#> {'val': {'a': 'foo', 'b': 'bar'}}

print(MyModel[InnerBase](val=Inner(a="foo", b="bar")).model_dump())
#> {'val': {'a': 'foo'}}

Option 2: just explicitly specify the generic parameters when instantiating the model (which I did in the final line above) — this would address the issue in both of your cases above (i.e., using obj: Model[Inner] = ModelInner`).

Sadly, I haven't found a great way to get the behavior you really want, which is that:

obj: MyModel[Inner] = MyModel(val=Inner(a="foo", b="bar"))

isn't actually a lie to mypy that can't be caught (since MyModel[Inner] is not a subclass of MyModel[InnerBase]).

(Knowing your skill with both python and pydantic, and your comments about wanting to capture the issue statically, I expect you've already considered both of the above approaches and find them wanting. But I figured I'd still include it explicitly for anyone else reading this who might not have considered them.)


Regarding improving the situation with type checkers:

I think this would be very hard to get handled better by mypy because of the sorts of assumptions that mypy's typing makes about how generic parameters work.

But, on the other hand, making pydantic generics behave more like standard typing generics would also be hard because it would be hard to get separate validators etc. producing the same class. (The idea would be to return a custom subclass of typing._GenericAlias that has a type-parameter-specific schema but ultimately creates instances of the same underlying class.)

I think not impossible, and @adriangb and I investigated it pretty deeply a few months ago before giving up, I think we both still think it's possible with enough elbow grease, though may introduce some (subtle) breaking changes for advanced users of pydantic generics — for example, at least without additional effort and API implementations, instances of the (generic model) class won't know what type parameters were used during their validation (in contrast with the current behavior, where a subclass specific to the specific set of type parameters is created). So at any rate such changes may be a better target for v3.

At any rate, definitely open to other suggestions related to how this might get improved.

@MarkusSintonen
Copy link
Contributor Author

MarkusSintonen commented Jul 28, 2023

Option 1

I feel its better to fix the issue we have in code where its accidentally using the generics without params. Would always remember to use SerializeAsAny in all the TypeVars which is quite special for Pydantic. So its prone to getting forgotten by many others. On the other hand its also prone to break when its possible to forget the generic params as it only shows in serialization 😕

Option 2

I think this is better option so using Pydantic generics "correctly". But also this is prone to suddenly fail as both options require developer to remember to use generics carefully.

@dmontagu What do you think about adding atleast deprecation warnings when user tries to call model initializers when it has free generic params? (I feel it shouldnt allow it to get initialized in V2). If the warning even has a dedicated class it would allow asserting in the unit tests that no such warning ever gets raised (assuming full code coverage).

FYI I had to temporarily use my fork to verify our large codebase didnt have any lingering issues: https://github.com/pydantic/pydantic/compare/main...MarkusSintonen:pydantic:check-no-generics-without-args?expand=1 This allowed to spot couple of additional risky places which were not captured by tests

@dmontagu
Copy link
Contributor

@MarkusSintonen My gut reaction is I don't love the idea of doing a deprecation warning about this because I could imagine a future where we actually change things up and decide this is the intended usage pattern, and it would be unfortunate to un-deprecate it. (And I wish it were possible for it to behave as desired like this, since this is the pattern that standard library generics use. But obviously there are reasons this is probably not a great choice for pydantic, which we've discussed above.)

I could be convinced otherwise on this, but I also want to suggest another option — a config setting for what to do if you try to instantiate a generic model with free parameters (with choices 'ignore', 'warn', 'error'). I think @adriangb wanted to make it an error to instantiate a generic basemodel with free parameters anyway, and this might be a reasonable migration path to that that doesn't commit us as much if we find a better/preferable solution to this in the future. And unlike most config settings, I think this wouldn't be very hard to support and wouldn't have complex interactions with other config settings, so I'm less concerned about the maintenance burden it would add.

Thoughts on that?

@MarkusSintonen
Copy link
Contributor Author

Thoughts on that?

Config parameter would sound fine to me!

I think @adriangb wanted to make it an error to instantiate a generic basemodel with free parameters

I also think it would be pretty sensible to make it error by default. That would make migration to V2 much safer. But could be configurable if someone really wants to use it so. Then in future if there is some kind of smarter inference (probably very hard) then the default could be changed

@dmontagu
Copy link
Contributor

dmontagu commented Jul 28, 2023

@MarkusSintonen another thing we could do that I was hesitant to mention, but will put on the table, is that we could apply the SerializeAsAny behavior automatically when attributes are typevars that are falling back to non-parametrized default values. (Only relevant if that proposed config setting isn't set to 'error', of course.)

My main concern is that this might be a bit less intuitive once you understand how things work, but it would make it safer to migrate from v1, and I suspect in most cases this would probably be the preferred behavior anyway if you are purposely not specifying a parametrization and know that you could. @adriangb not sure if you'd disagree strongly enough to veto that idea though, haha.

@dmontagu dmontagu removed the unconfirmed Bug not yet confirmed as valid/applicable label Jul 28, 2023
@MarkusSintonen
Copy link
Contributor Author

SerializeAsAny behavior automatically when attributes are typevars that are falling back to non-parametrized default values

Does it incur then some "hidden" performance penalty btw?

@sydney-runkle
Copy link
Member

sydney-runkle commented Oct 13, 2023

I think this was fixed with #7606.

Happy to reopen if anyone is still running into this problem, but the initial MRE is now passing for me on the latest version of Pydantic 😄.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug V2 Bug related to Pydantic V2
Projects
None yet
Development

No branches or pull requests

4 participants