Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the collector configuration in an external ConfigMap #2871

Open
matthagenbuch opened this issue Apr 17, 2024 · 8 comments · May be fixed by #2946
Open

Specify the collector configuration in an external ConfigMap #2871

matthagenbuch opened this issue Apr 17, 2024 · 8 comments · May be fixed by #2946
Labels
area:collector Issues for deploying collector enhancement New feature or request needs triage

Comments

@matthagenbuch
Copy link

matthagenbuch commented Apr 17, 2024

Component(s)

collector

Is your feature request related to a problem? Please describe.

When changing the configuration of a collector, I have no option for a controlled rollout / rollback process in the case of an error. Once the new configuration is applied to my OpenTelemetryCollector resource, the old configuration is overwritten and all running collector pods now see the new configuration.

This makes it difficult to control the rollout process, as any pod that restarts for any reason will load with the new configuration. It also makes it impossible to roll back to the previous configuration without again updating the OpenTelemetryCollector resource.

Describe the solution you'd like

For other services, we "version" the configuration in ConfigMap resources. Each time a configuration is updated, a new ConfigMap with a unique name is created and the deployment/statefulset is updated to reference it. This ensures that running pods see no change in configuration, and rolling back can be done by an external tool such as Argo Rollouts since the old configuration is still on the cluster.

Ideally, I'd like to be able to specify the OpenTelemetryCollector configuration by passing in a ConfigMap, which I manage myself. This would allow me to use the same versioning / rollback system as for other services I own.

Describe alternatives you've considered

The OpenTelemetry Operator could automatically create and manage versioned ConfigMaps for the collector configuration. This might be a useful OOTB feature for some users, but in my case I would prefer to manage ConfigMaps external to the operator in order to guarantee consistency with my other services.

Additional context

There is a similar open issue for specifying multiple configurations to be merged. The solutions proposed there which involve specifying an external ConfigMap would also resolve this feature request.

Currently, the operator does have a configmaps field which could be used to mount the collector configuration. However, the operator explicitly doesn't support overriding the --config arg, so we can't use these ConfigMaps for collector configuration.

@matthagenbuch matthagenbuch added enhancement New feature or request needs triage labels Apr 17, 2024
@swiatekm-sumo
Copy link
Contributor

I don't see any obvious reason not to do this. Part of me does want to go the versioned configmap route, as our plans for the operator's future go more in the direction of providing higher-level abstractions on top of the raw collector config. Letting users use an external ConfigMap solves this specific problem, but will also potentially lock them out of distributed configuration and other similar features. But these solutions aren't mutually exclusive. If you'd like to submit a PR to add external ConfigMap support, I'd be willing to accept it. WDYT @jaronoff97 @pavolloffay ?

@jaronoff97
Copy link
Contributor

I think adding support for allowing the overriding of the config makes sense. I also think this probably gets easier in a world with distributed configuration, though I don't think we need to wait for that world.

@CarlosLanderas
Copy link

CarlosLanderas commented May 17, 2024

@matthagenbuch Thanks for this contribution, I think it's a total must.

Using a unique configmap can have serious consequences on previous working versions once a faulty config is deployed when mutating the Otel Col.

We found out, using the default maxSurge of maxUnavailable of 25%, the amount of pods replaced with failed configuration causes the working pods to have increated pressure, eventually going out of resources and restarting adopting the new faulty configuration.

This can cause a nice fallout and wipe all the working collectors.

I was specifically looking for this and found this issue.

@pavolloffay pavolloffay added the area:collector Issues for deploying collector label May 17, 2024
@matthagenbuch
Copy link
Author

We found out, using the default maxSurge of maxUnavailable of 25%, the amount of pods replaced with failed configuration causes the working pods to have increated pressure, eventually going out of resources and restarting adopting the new faulty configuration.

@CarlosLanderas Yes this is a particularly nasty failure mode in the current implementation. I'm working on a PR to add support for multiple versioned ConfigMaps, which should prevent such failures from propagating. When investigating, I found that specifying an external ConfigMap (as mentioned in this issue) would break other operator functionality which relies on the collector config at the time of provisioning.

@CarlosLanderas
Copy link

CarlosLanderas commented May 20, 2024

Yeah @matthagenbuch, saw the PR that's why I left the comment :).

Thanks for the contribution!

PD: Is there any chance to overcome the limitation you mention?

When running the collector in prod at scale this is very problematic.

For now, we are gonna rely on k8s integration tests before deploying a new version config.

@jaronoff97
Copy link
Contributor

@CarlosLanderas would something like this issue be helpful?

@CarlosLanderas
Copy link

CarlosLanderas commented May 21, 2024

@CarlosLanderas would something like this issue be helpful?

@jaronoff97, does this mean the operator will run config validations before patching the configmap and deployment/daemonset?

In that case, it will definitely help :)

@pavolloffay
Copy link
Member

Unfortunately the operator does not validate collector config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:collector Issues for deploying collector enhancement New feature or request needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants