Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart delay not working when agent process exits unexpectedly #27891

Closed
andykellr opened this issue Oct 20, 2023 · 3 comments · Fixed by #32150
Closed

Restart delay not working when agent process exits unexpectedly #27891

andykellr opened this issue Oct 20, 2023 · 3 comments · Fixed by #32150
Labels
bug Something isn't working cmd/opampsupervisor

Comments

@andykellr
Copy link

andykellr commented Oct 20, 2023

Component(s)

cmd/opampsupervisor

What happened?

Description

When the supervisor fails to start the collector, it will log that it will restart the collector in a bit, but it doesn't actually wait 5s to restart and instead logs the message over and over.

Steps to Reproduce

Send a bad config from an OpAMP server to the supervisor. The collector will fail to start and exit. The supervisor will be caught in a tight loop logging that it will "restart in a bit..."

Expected Result

The supervisor will wait 5s and then attempt to restart the collector.

Actual Result

The supervisor logs this message over and over.

2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}
2023-10-18T21:51:28.581-0400	DEBUG	supervisor/supervisor.go:619	Agent process exited unexpectedly. Will restart in a bit...	{"pid": 46996, "exit_code": 1}

Collector version

05ec3a2

Environment information

Environment

OS: m1 Mac
Compiler(if manually compiled): go1.21.3 darwin/arm64

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

Based on a quick review of the code, when the Commander watch() method closes the doneCh, the Supervisor gets stuck in case <-s.commander.Done() as the closed channel will continue to produce signals.

Tested using the OpAMP Agent Extension PR
#16594

@andykellr andykellr added bug Something isn't working needs triage New item requiring triage labels Oct 20, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@andykellr andykellr changed the title Fix restart delay when agent process exits unexpectedly Restart delay not working when agent process exits unexpectedly Oct 20, 2023
@evan-bradley evan-bradley removed the needs triage New item requiring triage label Oct 24, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

github-actions bot commented Mar 4, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 4, 2024
evan-bradley added a commit that referenced this issue Apr 16, 2024
…pectedly (#32150)

**Description:**

Reset should called only on stopped or expired timers with drained
channels. If the timer already expired (and the channel was not cleared)
it reads from the timer's channel to clear it.



**Link to tracking Issue:** Fixes
#27891

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…pectedly (open-telemetry#32150)

**Description:**

Reset should called only on stopped or expired timers with drained
channels. If the timer already expired (and the channel was not cleared)
it reads from the timer's channel to clear it.



**Link to tracking Issue:** Fixes
open-telemetry#27891

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>

---------

Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment