Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for variable referencing #49

Open
alexanderswerdlow opened this issue May 4, 2023 · 6 comments
Open

Support for variable referencing #49

alexanderswerdlow opened this issue May 4, 2023 · 6 comments

Comments

@alexanderswerdlow
Copy link

I've been looking for an alternative to Hydra for config management, specifically one that allows for defining configs in Python, and I stumbled across Tyro which seems like a great library for my use case after some experimentation.

However, one thing that doesn't appear to be possible is referencing a single variable from multiple places in a nested config. As for why this might be needed, it is very common in an ML codebase to require the same parameter in many different places. For example, the number of classification classes might be used in the model construction, visualization, etc.

We might want this value to be dependent on a config group such as the dataset (i.e. each dataset might have a different number of classes). Instead of manually defining each combination of model + dataset, it would be a lot easier to have the model parameters simply reference the dataset parameter, or have them both reference some top-level variable. With Hydra, there is value interpolation that does this.

Since we can define Tyro configs directly in Python, it seems like this could be made much more powerful with support for arbitrary expressions allowing small pieces of logic to be defined in a configuration (e.g., for a specific top-level config we can have a model parameter be 4 * num_classes). Clearly, we could simply make the 4 into a new parameter but there are good reasons we might want it in the config instead.

From what I can tell, this type of variable referencing, even without any expressions, is not currently possible with Tyro.

@brentyi
Copy link
Owner

brentyi commented May 4, 2023

Thanks @alexanderswerdlow!

Just to concretize things a bit: could you suggest some syntax for what you're describing here?

My weak prior on this is that these sort of "driven" config parameters result in unnecessary complexity, and there are often simple workarounds like defining an @property that computes and returns 4 * num_classes or conditioning your model instantiation on the dataset config. But with a concrete example I could be convinced.

@alexanderswerdlow
Copy link
Author

Of course! I think I should clarify that my initial comment mentions two related but distinct features. The first is just plain variable referencing.

For example, I've recently been working with an architecture that can turn some input image into a set of latent vectors or "slots." This is a hyperparameter (num_slots) to the model (e.g., nn.Module) but I also need this parameter to be passed to some separate visualization code that visualizes the effect of each slot. This hyperparameter is dataset dependent, so not only would it be silly to have to define this multiple times (for each module that uses it), but it makes configuring an experiment more difficult. Now, instead of having model configurations and dataset configurations that I can mix-and-match, I have to create each combination manually.

Conditioning the model init on the dataset config does work although there's a couple issues with that:

  1. It's cumbersome to do if you instantiate most of all of your objects directly from your config. You could get around this by passing a config object as a func. call later on and initializing there, but then you're defeating the purpose of instantiating from config.

Hydra's interpolation works for objects as well as primitives so in the example below num_slots could be an entire object (e.g., dataset obj/config) that gets passed to the model. A lot of my code currently relies on this, with a single config dataclass that gets passed around.

  1. Even with this strategy, there are limitations. Namely, it encourages consolidating to a single large config object. If two places in code reference a single variable that you later want to separate, you now have to refactor this. With variable referencing, you just replace the reference with a different default value.

Furthermore, if you do go with the alternative and pass several individual objects, it becomes difficult when you have a messy dependency graph. The model/viz is often dependent on the dataset, the viz is dependent on the model, and some parts of the model are dependent on others. Keeping this modular for experimentation requires referencing each other.

As for syntax, I'd likely need to spend more time thinking about it but my current hydra config looks something like this, using the interpolation syntax:

num_slots: 10
dataset:
    num_slots: ${num_slots}
model:
    encoder_stage:
        output_dim: 256
    decoder_stage:
        input_dim: ${model.encoder_stage.output_dim}
        num_slots: ${num_slots}

Here is an example in Tyro (not one-to-one with the example above to be concise),

@dataclasses.dataclass
class DatasetConfig:
    dataset: Dataset
    num_slots: int
    
@dataclasses.dataclass
class ExperimentConfig:
    model: nn.Module
    dataset: DatasetConfig

main_config = tyro.extras.subcommand_type_from_defaults({
    "small": ExperimentConfig(
            dataset=ExampleDatasetConfig,
            model=ExampleModelConfig,
        ),
})

ExampleDatasetConfig = Annotated[
    DatasetConfig,
    tyro.conf.subcommand(
        name="dataset_b",
        default=DatasetConfig(
            num_slots=8,
            dataset=DatasetB(),
        ),
    ),
]

ExampleModelConfig = Annotated[
    ClassificationModule, # name of an nn.Module
    tyro.conf.subcommand(
        name="dataset_a",
        default=ClassificationModule(
            num_slots=ExperimentConfig.dataset.num_slots, # Somehow reference the encapsulating container
        ),
    ),
]

Now obviously this wouldn't work exactly as-is because there is a circular dependency in definitions here. In my Tyro example above, I go up (so to speak) to experiment config and then back down to dataset, but a single overarching namespace could be simpler to implement (e.g., referencing only works for a set of pre-defined keys, not from any arbitrary container).

The second feature is much smaller both in impact and difficulty but is the ability to perform expressions on variable referencing. You are absolutely right that declaring an @property could work, and if Tyro supports passing functions (e.g., lambda defined in code), you could achieve the same thing.

However, say you have an input image that is downscaled by n (e.g., n=4) and then a separate module (e.g., visualization code) needs to know that downscaled size during initialization. In this case, it'd be a lot cleaner to have as input image_size / n as opposed to passing both of those into the visualization code. The desire for these sorts of expressions comes up naturally in a lot of data pipelines.

Hope that makes sense and I'm happy to explain further! Also totally understand if this is out of scope.

@brentyi
Copy link
Owner

brentyi commented May 5, 2023

Yes, that makes sense!

For variable references, I'm curious about your thoughts on a few options.

One is adapting __post_init__:

import dataclasses

import tyro


@dataclasses.dataclass
class ModelConfig:
    num_slots: int = -1


@dataclasses.dataclass
class TrainConfig:
    num_slots: int
    model: ModelConfig

    def __post_init__(self) -> None:
        if self.model.num_slots == -1:
            self.model.num_slots = self.num_slots


print(tyro.cli(TrainConfig))

In this case --num-slots 3 would set num_slots for both the parent TrainConfig and the inner ModelConfig.

Of course you can add more complex logic in your __post_init__, so this might fulfill your second feature request too.

A potential downside of this is that you won't be able to use an nn.Module directly in your config object as your snippet hints at; config objects need to be mutated after they're instantiated. IMO this is an OK tradeoff since directly dropping in the module has its own drawbacks, like difficulty of serialization.

An alternative option that might be used to circumvent this downside — it's a bit hacky and I wouldn't recommend it, but should work and is in the tests for the 0.5.0 release — is to map both the model's num_slots and the train config's num_slots to --num-slots. To do this we can just omit the prefix from --model.num-slots:

import dataclasses

from typing import Annotated
import tyro


@dataclasses.dataclass
class ModelConfig:
    num_slots: Annotated[int, tyro.conf.arg(prefix_name=False)]


@dataclasses.dataclass
class TrainConfig:
    num_slots: int
    model: ModelConfig

    # edit: next few lines were unintentionally included
    # def __post_init__(self) -> None:
    #     if self.model.num_slots == -1:
    #         self.model.num_slots = self.num_slots


print(tyro.cli(TrainConfig))

Again, --num-slots 3 would set num_slots for both the parent TrainConfig and the inner ModelConfig.

@alexanderswerdlow
Copy link
Author

Sorry for the delay and clever idea!

Before I go on, I assume for the 2nd example, you didn't intend to include the __post_init__. I ran it myself and it works just with the annotation which makes sense.

Some thoughts:

  1. The two options seem to tradeoff customizability for convenience.

The first option allows arbitrary configuration with expressions but the syntax is a little unwieldy. The second option on the other hand (from what I can tell) essentially gives you a single global namespace (simply without a prefix) to perform referencing.

Most use cases are probably fine with a global namespace but I think a core issue remains (for my use case at least).

  1. The bigger issue I see here (for my use case at least) is that either approach couples the configuration interface with the config itself, making hierarchical and modular configuration difficult.

In other words, say I have an MLP class (dataclass config or actual class); I might want different experiments to use that same MLP in different ways (likely multiple times within the same experiment). That rules out the 2nd approach, but even the 1st approach is difficult. From what I can tell, the user would need to make two distinct higher-level configs (to allow for a different __post_init__).

Now I certainly see that this might not be an issue for many and this approach might make a lot of sense for them! I happen to need things particularly modular for experimentation, which is also why I gravitate towards instantiating things directly. Doing so removes an intermediate step that needs to be constantly updated.

@brentyi
Copy link
Owner

brentyi commented May 9, 2023

Thanks for clarifying! Two followup questions:

The bigger issue I see here (for my use case at least) is that either approach couples the configuration interface with the config itself, making hierarchical and modular configuration difficult.

(1) So to re-state: it would be nice to be able to define a config schema via dataclasses, and then define relationships between values in it when you instantiate configs?

I happen to need things particularly modular for experimentation, which is also why I gravitate towards instantiating things directly. Doing so removes an intermediate step that needs to be constantly updated.

(2) I'm not totally following what "instantiating things directly" is referring to. Is this referencing the __post_init__() as an intermediate step?


To try and resolve (1), what about creating some subcommands? When you instantiate each subcommand the default values for each field can be computed from whatever logic you want.

import dataclasses
from typing import Dict

import tyro


@dataclasses.dataclass
class ModelConfig:
    num_slots: int


@dataclasses.dataclass
class TrainConfig:
    num_slots: int
    model: ModelConfig


subcommands: Dict[str, TrainConfig] = {}

# First experiment.
subcommands["exp1"] = TrainConfig(
    num_slots=2,
    model=ModelConfig(num_slots=2),
)

# Second experiment.
num_slots = 4
subcommands["exp2"] = TrainConfig(
    num_slots=num_slots,
    model=ModelConfig(num_slots=num_slots * 2),
)

config = tyro.cli(
    tyro.extras.subcommand_type_from_defaults(subcommands)
)
print(config)

Of course since everything is Python, you can also generate this dictionary programatically. Perhaps the downside here is that python example.py exp2 --num-slots N now won't also update model.num_slots? Is that a dealbreaker?

In general I think there's still a disconnect where I don't fully follow what limitation makes modularity/hierarchy harder than in Hydra. When I read the specializing configs docs in Hydra nothing stands out to me — both the ${dataset}_${model} pattern and the CIFAR vs ImageNet num_layers default seem easy enough to replicate in Python. If you have any links to examples in-the-wild (your own or from others) that you'd want to replicate it might be helpful for my understanding.

@brentyi
Copy link
Owner

brentyi commented May 9, 2023

As an FYI, I'm also going to raise an error in this case:

import dataclasses

from typing import Annotated
import tyro


@dataclasses.dataclass
class ModelConfig:
    num_slots: Annotated[int, tyro.conf.arg(prefix_name=False)]


@dataclasses.dataclass
class TrainConfig:
    num_slots: int
    model: ModelConfig

(just feels too hacky)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants