Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: On agent upgrade failure for first time, review error badge is not displayed #183243

Open
harshitgupta-qasource opened this issue May 13, 2024 · 10 comments
Labels
bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@harshitgupta-qasource
Copy link

Kibana Build details:

VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Preconditions:

  1. 8.14.0-BC4 Kibana cloud environment should be available.
  2. 8.13.4 agent should be deployed
  3. Wrong Agent Binary should be added.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Select 2-3 agent after click on checkbox.
  3. Click on action button and then select upgrade agent.
  4. Enter the latest agent version and perform upgrade agent.
  5. Wait for 10-20 minutes.
  6. Observe that agent upgrade failure for first time, review error badge is not displayed.

Expected Result:
On agent upgrade failure for first time, review error badge should display.

Screen Shot:
image

@harshitgupta-qasource harshitgupta-qasource added bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team labels May 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link

Secondary review for this ticket is Done.

@kpollich
Copy link
Member

@jillguyonnet - Could you weigh in on this? AFAIU, the "review errors" badge should appear when the polling request detects an error in this case, right?

@jillguyonnet
Copy link
Contributor

@kpollich That's correct, with a caveat that the polling request only queries the last 35 seconds (this comment details the logic). It would be good to clarify a few details in order to understand this scenario.

  1. The first thing to check should be whether there is an actual error in the agent activity flyout. While I was testing this, I noticed that there wasn't one in all scenarios. In the example below, I made failed upgrades by manually entering an invalid version; the horde agent failed upgrade resulted in an action status item with "status":"FAILED" and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE". Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)

Action status after failed upgrade for horde agent:

{"actionId":"1345158b-e460-462c-b480-48f691147bce","nbAgentsActionCreated":1,"nbAgentsAck":0,"version":"8.11.22","type":"UPGRADE","nbAgentsActioned":1,"status":"FAILED","expiration":"2024-06-13T16:39:46.846Z","creationTime":"2024-05-14T16:39:46.846Z","nbAgentsFailed":1,"hasRolloutPeriod":false,"completionTime":"0001-01-01T00:00:00.000Z","latestErrors":[{"agentId":"3d485f27-db35-41fb-af80-f4b122a254cc","error":"HTTP Fail","timestamp":"0001-01-01T00:00:00Z","hostname":"eh-Snakerowan-5Nbx"}]}

Action status after failed upgrade for 2 agents on Multipass:

{"actionId":"4bc043d8-026a-4e86-8907-8b4beb9f329a","nbAgentsActionCreated":2,"nbAgentsAck":2,"version":"8.12.9","startTime":"2024-05-14T16:24:16.988Z","type":"UPGRADE","nbAgentsActioned":2,"status":"COMPLETE","expiration":"2024-06-13T16:24:16.988Z","creationTime":"2024-05-14T16:24:30.324Z","nbAgentsFailed":0,"hasRolloutPeriod":false,"completionTime":"2024-05-14T16:39:24.952Z","latestErrors":[]}
Screenshot 2024-05-14 at 18 50 02
  1. If there is an actual error, the next thing to investigate is whether the polling request actually catches it (which would cause the badge to render). As noted above, the polling request fetches the most recent actions from the last 35 seconds; in theory, if the Agents page stays open and is not refreshed, then that should happen at some point. It would be great to confirm that it doesn't render and then disappear (which could unfortunately be tedious). Otherwise, if the badge never renders, I'm wondering if this might be a case of the action being "older" (i.e. created before the upgrade failed) and updated to failed status, which would cause the polling to never catch it. If the latter scenario is confirmed, then it's definitely a bug.

@cmacknz
Copy link
Member

cmacknz commented May 14, 2024

In the example below, I made failed upgrades by manually entering an invalid version; the horde agent failed upgrade resulted in an action status item with "status":"FAILED" and associated errors, while the agents on Multipass got an action status item with "status":"COMPLETE". Consequently, the "Review errors" badge only showed up for the horde agent. (As a side note, I would like to clarify where this difference is coming from, I'm not sure whether it's expected.)

The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.

What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.

@jillguyonnet
Copy link
Contributor

The horde implementation has diverged from the agent somehow, but it's not clear just reading this what it might be.

What version did you use when you tested this? Depending on the exact format it might hit different parts of the agent code. For example if it looked valid but didn't exist I'd have expected the agent to attempt to download it and report recurring failures doing that.

I agree it's not clear from this testing. The version difference is a good point, so I redid a quick test with the following 3 agents. The TL;DR is horde agents fail fast with a failed request error, probably because they are trying to fetch a nonexistent resource. In contrast, the agent I enrolled manually on a VM did try the upgrade.

  1. An agent on Multipass VMs enrolled on version 8.12.0. I tried an upgrade to 8.12.9: fairly quickly, when the agent's upgrade details had status UPG_DOWNLOADING, there was an error message as expected in the upgrade details metadata. The agent stayed in that state for a few minutes before the upgrade became stuck in failed state.

Shortly after starting the upgrade:
Screenshot 2024-05-15 at 10 11 38

Agent details page:
Screenshot 2024-05-15 at 10 11 54

After a few minutes, Fleet status is back to healthy:
Screenshot 2024-05-15 at 10 25 50

After a few more minutes, the upgrade stops retrying and a warning message is shown:
Screenshot 2024-05-15 at 10 31 12

Agent details page:
Screenshot 2024-05-15 at 10 31 24

Agent JSON:

Click to expand
{
"id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60",
"type": "PERMANENT",
"active": true,
"enrolled_at": "2024-05-15T08:08:37Z",
"upgraded_at": "2024-05-15T08:25:26Z",
"upgrade_started_at": null,
"upgrade_details": {
 "metadata": {
   "retry_error_msg": "unable to download package: 2 errors occurred:\n\t* package '/opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz' not found: open /opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.9-linux-arm64.tar.gz' returned unsuccessful status code: 404\n\n",
   "retry_until": "2024-05-15T12:10:48.014713177+02:00",
   "error_msg": "failed download of agent binary: unable to download package: 2 errors occurred:\n\t* package '/opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz' not found: open /opt/Elastic/Agent/data/elastic-agent-5cbf2e/downloads/elastic-agent-8.12.9-linux-arm64.tar.gz: no such file or directory\n\t* call to 'https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.9-linux-arm64.tar.gz' returned unsuccessful status code: 404\n\n",
   "failed_state": "UPG_DOWNLOADING"
 },
 "action_id": "6fd4c70e-0fe7-40e5-995c-66b89877b5bd",
 "state": "UPG_FAILED",
 "target_version": "8.12.9"
},
"access_api_key_id": "znRLe48BFQaBON2J0pQx",
"policy_id": "e9b7752e-2527-4759-a11d-01220a89fcec",
"last_checkin": "2024-05-15T08:38:58Z",
"last_checkin_status": "online",
"last_checkin_message": "Running",
"policy_revision": 1,
"packages": [],
"sort": [
 1715760517000
],
"outputs": {
 "default": {
   "api_key_id": "0HRLe48BFQaBON2J3ZSu",
   "type": "elasticsearch"
 }
},
"components": [
 {
   "id": "log-default",
   "type": "log",
   "status": "HEALTHY",
   "message": "Healthy: communicating with pid '2232'",
   "units": [
     {
       "id": "log-default-logfile-system-723bd4a9-11af-4eef-bb5e-06d03c84f17b",
       "type": "input",
       "status": "HEALTHY",
       "message": "Healthy"
     },
     {
       "id": "log-default",
       "type": "output",
       "status": "HEALTHY",
       "message": "Healthy"
     }
   ]
 },
 {
   "id": "system/metrics-default",
   "type": "system/metrics",
   "status": "HEALTHY",
   "message": "Healthy: communicating with pid '2237'",
   "units": [
     {
       "id": "system/metrics-default-system/metrics-system-723bd4a9-11af-4eef-bb5e-06d03c84f17b",
       "type": "input",
       "status": "HEALTHY",
       "message": "Healthy"
     },
     {
       "id": "system/metrics-default",
       "type": "output",
       "status": "HEALTHY",
       "message": "Healthy"
     }
   ]
 },
 {
   "id": "filestream-monitoring",
   "type": "filestream",
   "status": "HEALTHY",
   "message": "Healthy: communicating with pid '2242'",
   "units": [
     {
       "id": "filestream-monitoring-filestream-monitoring-agent",
       "type": "input",
       "status": "HEALTHY",
       "message": "Healthy"
     },
     {
       "id": "filestream-monitoring",
       "type": "output",
       "status": "HEALTHY",
       "message": "Healthy"
     }
   ]
 },
 {
   "id": "beat/metrics-monitoring",
   "type": "beat/metrics",
   "status": "HEALTHY",
   "message": "Healthy: communicating with pid '2249'",
   "units": [
     {
       "id": "beat/metrics-monitoring-metrics-monitoring-beats",
       "type": "input",
       "status": "HEALTHY",
       "message": "Healthy"
     },
     {
       "id": "beat/metrics-monitoring",
       "type": "output",
       "status": "HEALTHY",
       "message": "Healthy"
     }
   ]
 },
 {
   "id": "http/metrics-monitoring",
   "type": "http/metrics",
   "status": "HEALTHY",
   "message": "Healthy: communicating with pid '2256'",
   "units": [
     {
       "id": "http/metrics-monitoring-metrics-monitoring-agent",
       "type": "input",
       "status": "HEALTHY",
       "message": "Healthy"
     },
     {
       "id": "http/metrics-monitoring",
       "type": "output",
       "status": "HEALTHY",
       "message": "Healthy"
     }
   ]
 }
],
"agent": {
 "id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60",
 "version": "8.12.0"
},
"local_metadata": {
 "elastic": {
   "agent": {
     "build.original": "8.12.0 (build: 5cbf2e403c761f91d11eca6b9cb5385e0f07f2ce at 2024-01-11 13:25:49 +0000 UTC)",
     "complete": false,
     "id": "d4666fb3-eb14-43ac-bd23-689b06ae0b60",
     "log_level": "info",
     "snapshot": false,
     "upgradeable": true,
     "version": "8.12.0"
   }
 },
 "host": {
   "architecture": "aarch64",
   "hostname": "agent1",
   "id": "1b252dddb2544378813a2756173ad9ab",
   "ip": [
     "127.0.0.1/8",
     "::1/128",
     "192.168.82.10/24",
     "fdf3:d299:5a7d:9ea6:5054:ff:fe1a:b5b9/64",
     "fe80::5054:ff:fe1a:b5b9/64"
   ],
   "mac": [
     "52:54:00:1a:b5:b9"
   ],
   "name": "agent1"
 },
 "os": {
   "family": "debian",
   "full": "Ubuntu noble(24.04 LTS (Noble Numbat))",
   "kernel": "6.8.0-31-generic",
   "name": "Ubuntu",
   "platform": "ubuntu",
   "version": "24.04 LTS (Noble Numbat)"
 }
},
"unhealthy_reason": null,
"status": "online",
"metrics": {
 "cpu_avg": 0.01083,
 "memory_size_byte_avg": 139992856
}
}
  1. A horde agent enrolled on version 8.6.0 (default) as in my previous test. The upgrade to a nonexistent version quickly failed with HTTP Fail. I did not see the agent going to Updating status.

Immediately after trying to upgrade to 8.6.9:
Screenshot 2024-05-15 at 10 12 48

Agent details page:
Screenshot 2024-05-15 at 10 13 00

Agent activity with error:
Screenshot 2024-05-15 at 10 13 35

  1. Another horde agent enrolled on version 8.12.0. The upgrade to a nonexistent version failed in the same way as the 8.6.0 horde agent.

@cmacknz
Copy link
Member

cmacknz commented May 15, 2024

For the real agent that is what I expected to see. It will retry the download until the download timeout expires, by default this is two hours. After that it should report the upgrade as failed.

@jillguyonnet
Copy link
Contributor

@cmacknz Can we configure the download timeout? It would make testing this a lot easier.

@cmacknz
Copy link
Member

cmacknz commented May 17, 2024

I think that agent didn't respect it when sent from the Fleet override API, but it's been a while since I tested this: elastic/elastic-agent#4580

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants