Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Avoid Pandera Doc Injection? #1564

Open
kernelpernel opened this issue Apr 11, 2024 · 5 comments
Open

How to Avoid Pandera Doc Injection? #1564

kernelpernel opened this issue Apr 11, 2024 · 5 comments
Labels
question Further information is requested

Comments

@kernelpernel
Copy link

Question about pandera

We use pandera where I work for our dataframe schema. We also use sphinx to generate docs for our python libraries. Unfortunately, the documentation for pandera.pandera.api.pandas.container.DataFrameSchema keeps getting injected into our sphinx-generated documentation. As a work around, we have made most of our schema classes private to prevent doc importing.

We have also tried to write decorators for our own classes to sanitize the docs, but this has been challenging as well. Looking at the entire attribute stack for a class that inherits from pa.DataFrameSchema, most of the doc attributes appear empty. When we try to scrub docs from pandera modules, we end up without any of our own documentation and only have the cat, dog, duck example from pa.DataFrameSchema.

Is this a pandera bug? If not, is there a way that we could suppress the doc injection without removing our own documentation?

TL;DR: pandera is injecting documentation into our own documentation (especially from pandera.pandera.api.pandas.container.DataFrameSchema). Is there a way to prevent this from happening?

@kernelpernel kernelpernel added the question Further information is requested label Apr 11, 2024
@cosmicBboy
Copy link
Collaborator

Thanks for bringing this up @kernelpernel, would it be possible to provide some screenshots and a minimally reproducible example? Don't really understand what you mean by docs being injected.

@kernelpernel
Copy link
Author

No screenshots due to possible IP conflicts, but I put together this quick example:

For example, if I write this class:

class ExampleSchema(pa.SchemaModel):
    """Schema to demonstrate doc injection."""

    Column1: sc.Integer = sc.IntegerF()
    Column2: sc.Str = sc.StrF()

I get this output for the sphinx-generated docs:

class jane_dev.options.utils.doc_testing.ExampleSchema(*args, **kwargs)

   Bases: "pandera.api.pandas.model.DataFrameModel"

   Schema to demonstrate doc injection.

   Check if all columns in a dataframe have a column in the Schema.

   Parameters:
      * **check_obj** (*pd.DataFrame*) -- the dataframe to be
        validated.

      * **head** -- validate the first n rows. Rows overlapping with
        "tail" or "sample" are de-duplicated.

      * **tail** -- validate the last n rows. Rows overlapping with
        "head" or "sample" are de-duplicated.

      * **sample** -- validate a random sample of n rows. Rows
        overlapping with "head" or "tail" are de-duplicated.

      * **random_state** -- random seed for the "sample" argument.

      * **lazy** -- if True, lazily evaluates dataframe against all
        validation checks and raises a "SchemaErrors". Otherwise,
        raise "SchemaError" as soon as one occurs.

      * **inplace** -- if True, applies coercion to the object of
        validation, otherwise creates a copy of the data.

   Returns:
      validated "DataFrame"

   Raises:
      **SchemaError** -- when "DataFrame" violates built-in or custom
      checks.

   Example:
   Calling "schema.validate" returns the dataframe.

   >>> import pandas as pd
   >>> import pandera as pa
   >>>
   >>> df = pd.DataFrame({
   ...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
   ...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"]
   ... })
   >>>
   >>> schema_withchecks = pa.DataFrameSchema({
   ...     "probability": pa.Column(
   ...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
   ...
   ...     # check that the "category" column contains a few discrete
   ...     # values, and the majority of the entries are dogs.
   ...     "category": pa.Column(
   ...         str, [
   ...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
   ...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
   ...         ]),
   ... })
   >>>
   >>> schema_withchecks.validate(df)[["probability", "category"]]
      probability category
   0         0.10      dog
   1         0.40      dog
   2         0.52      cat
   3         0.23     duck
   4         0.80      dog
   5         0.76      dog

   Column1: pandera.typing.pandas.Series[pandas.core.arrays.integer.Int64Dtype] = 'Column1'

   Column2: pandera.typing.pandas.Series[str] = 'Column2'

   class Config

      Bases: "pandera.api.pandas.model_config.BaseConfig"

      name: str | None = 'ExampleSchema'

         name of schema

Where I would expect to only see this:

class jane_dev.options.utils.doc_testing.ExampleSchema(*args, **kwargs)

   Bases: "pandera.api.pandas.model.DataFrameModel"

   Schema to demonstrate doc injection.

   Column1: pandera.typing.pandas.Series[pandas.core.arrays.integer.Int64Dtype] = 'Column1'

   Column2: pandera.typing.pandas.Series[str] = 'Column2'

And the docs appear to be the same as those from here:
Pandera Docs

@kernelpernel
Copy link
Author

Thanks for the quick response @cosmicBboy !

@cosmicBboy
Copy link
Collaborator

It's probably because of the __new__ method: https://github.com/unionai-oss/pandera/blob/main/pandera/api/dataframe/model.py#L127-L132

Can you try overriding that method and seeing if it happens?

@cosmicBboy
Copy link
Collaborator

@kernelpernel any updates on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants