Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep observation data valid for 60 min and retry with no data for nws #117109

Merged
merged 9 commits into from
May 22, 2024

Conversation

MatthewFlamm
Copy link
Contributor

@MatthewFlamm MatthewFlamm commented May 8, 2024

Proposed change

Retry when no data is returned during an update. Use a custom DataUpdateCoordinator to mark data as unavailable after 60 minutes for observations.

Unlike recent PRs for nws, this one should go in a new minor release.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

@home-assistant
Copy link

home-assistant bot commented May 8, 2024

Hey there @kamiyo, mind taking a look at this pull request as it has been labeled with an integration (nws) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of nws can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign nws Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

@MatthewFlamm MatthewFlamm changed the title NWS: retry when no data is returned in update Retry when no data is returned in update for nws May 8, 2024
@bdraco
Copy link
Member

bdraco commented May 8, 2024

Just to be certain, the intent is to NOT tag this for a patch release?

@MatthewFlamm
Copy link
Contributor Author

MatthewFlamm commented May 8, 2024

I mean it could, but this one is quality of life upgrade only. Which is why I intended it to do in a new monthly release. If you feel it can go in a patch, Im okay with that too.

Edit: the others were definitely fixes to existing issues and were more critical. This one just adds some nice to have feature.

@bdraco
Copy link
Member

bdraco commented May 8, 2024

I'm seeing unavailable after updating, but it may be because I've restarted so many times and I'm now rate limted

Screenshot 2024-05-08 at 3 08 49 PM Screenshot 2024-05-08 at 3 09 23 PM

Will check it again later tonight

@MatthewFlamm
Copy link
Contributor Author

Getting these myself, which shows the usefulness of this addition. I think I need to make this draft for now anyway as the errors could be handled better in the coordinator:

[traceback shortened]
 return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.12/concurrent/futures/_base.py", li
ne 401, in __get_result
    raise self._exception
  File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
tenacity/_asyncio.py", line 50, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
pynws/simple_nws.py", line 262, in update_observation
    raise NwsNoDataError("Observation received with no data.")
pynws.nws.NwsNoDataError: Observation received with no data.

@MatthewFlamm MatthewFlamm marked this pull request as draft May 8, 2024 20:23
@bdraco
Copy link
Member

bdraco commented May 8, 2024

It did recover after a few minutes. The missing data for PHOG is not a regression.

Screenshot 2024-05-08 at 3 38 14 PM Screenshot 2024-05-08 at 3 38 19 PM

@MatthewFlamm
Copy link
Contributor Author

I see that is common to catch custom errors and wrap with UpdateFailed. Ironically, the most straightforward way to put this into nws integration is to get rid of the functools.partial usage.

@MatthewFlamm
Copy link
Contributor Author

It did recover after a few minutes. The missing data for PHOG is not a regression.

Now that I have a potential addition for catching the error better, the server goes back up!

@MatthewFlamm
Copy link
Contributor Author

Error in log is much better now:

(MainThread) [homeassistant.component
s.nws] Error fetching NWS observation station PAFM data: No data re
turned.

@MatthewFlamm MatthewFlamm marked this pull request as ready for review May 8, 2024 23:48
@bdraco
Copy link
Member

bdraco commented May 9, 2024

After restart. Hourly is there, but all other data is unavailable

Will check it again in a bit
Screenshot 2024-05-09 at 7 43 18 AM
Screenshot 2024-05-09 at 7 43 24 AM

@MatthewFlamm
Copy link
Contributor Author

MatthewFlamm commented May 9, 2024

Looks okay to me on my side...

@bdraco
Copy link
Member

bdraco commented May 11, 2024

I've been running it for a few days now and getting drop outs. Checkout the logbook on the right of the below screenshots. Let me know if any logging I can enable would be helpful

Screenshot 2024-05-12 at 8 25 56 AM

@MatthewFlamm
Copy link
Contributor Author

Hmmm. I did find a bug in pynws that would interact with this PR, but it is only for the forecasts and not the observations that you are showing. The bug only means that this PR would cause the caching of the forecast data to get overwritten when no data is returned by NWS. This would result in behavior similar to dev branch. I will fix this and bump here as we are already bumping it here. Will mark as draft.

This looks like NWS server flakiness to me, which tends to fluctuate over time. Some time periods it is solid, some time periods there are frequent drop outs. The only way to find out with the current logging is to monitor the API calls and data. Set pynws logger to debug and you will get all API calls and data returned. It is a lot of data you will log. I would look for 500 errors in the observation coordinator refresh, usually 502, and when you get [] returned (this PR catches this for restarts).

If these drop outs are for no data, then in the current dev branch, you would have had unknown instead of unavailable.

@MatthewFlamm
Copy link
Contributor Author

pynws version bumped again to fix forecasts caching particularly with retries. Added PR release notes to description. This won't affect the lack of data in observations as noted above.

@bdraco
Copy link
Member

bdraco commented May 16, 2024

I noticed the API worked in Chrome, but not via the lib.

The no data problem seems to have gone away if I change the headers
MatthewFlamm/pynws#201

Maybe I'm just getting lucky though...

@bdraco
Copy link
Member

bdraco commented May 16, 2024

Will report back tomorrow and see if it drops out overnight as its going to be very obvious if there is a difference

Screenshot 2024-05-16 at 6 09 44 PM

@MatthewFlamm
Copy link
Contributor Author

MatthewFlamm commented May 17, 2024

The no data problem seems to have gone away if I change the headers
MatthewFlamm/pynws#201

This will likely be a temporary fix until the server starts messing up the cache for this new user agent again. This has happened in the past on another end point on this server. See weather-gov/api#492. They explicitly request developers not to use cache-busting techniques however (I'm not suggesting that you are proposing this).

My plan is to gather some key raw data and open a ticket.

I have confirmed in the past that our request interval would be nowhere near frequent enough to cause a problem (in regards to use terms, but practically I'm not sure?).

@MatthewFlamm
Copy link
Contributor Author

One thought for feedback... We keep track of the last successful update times (including whether data is returned) in the entry.runtime_data. Then we only raise UpdateFailed in the coordinator call when this time is >X minutes ago (say 60 minutes for now). This would prevent the shorter drop outs and still be faithful to the data source. This essentially returns us to what NWS did prior to the changes in the coordinator logic that we discussed over discord. That is, treat data as valid for X minutes even over multiple failed updates.

@MatthewFlamm
Copy link
Contributor Author

Most significant changes in 8b99d0e include using a custom DataUpdateCoordinator for the observations. This enables us to internally raise an UpdateFailed only after a certain amount of time elapses from the last successful update. In my limited testing, it is much more robust to outages in NWS.

In #117533, it was observed that getting no data in HA correlated with getting 2+ hr old data from a simpler curl fetch. We are filtering out old data in NWS integration by using the start_time parameter, which is causing no data to be returned.

I'm opening this for review now. We could consider wrapping the forecast coordinators into this change too. Pending feedback, I opted to not include this since it would make this PR even larger and there is already caching done in weather.get_forecasts logic. We need to avoid double dipping. This might become important for #117254 and would be more appropriate to add to that PR IMO.

@MatthewFlamm MatthewFlamm marked this pull request as ready for review May 20, 2024 19:07
@MatthewFlamm MatthewFlamm changed the title Retry when no data is returned in update for nws Retry when no data is returned in update and keep observation data valid for 60 min for nws May 20, 2024
@MatthewFlamm MatthewFlamm changed the title Retry when no data is returned in update and keep observation data valid for 60 min for nws Keep observation data valid for 60 min and retry with no data for nws May 20, 2024
@bdraco
Copy link
Member

bdraco commented May 20, 2024

If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!

@MatthewFlamm MatthewFlamm mentioned this pull request May 20, 2024
20 tasks
@MatthewFlamm
Copy link
Contributor Author

If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!

#117820

@MatthewFlamm
Copy link
Contributor Author

One unresolved pain point in this implementation is that if the first update fails, it will wait a full 10 minutes to make another update request. To keep the PR less unwieldy as it is already getting very lengthy, I would propose this for a future PR.

@bdraco
Copy link
Member

bdraco commented May 20, 2024

I'll put this on my production system in a bit

@bdraco
Copy link
Member

bdraco commented May 20, 2024

Screenshot 2024-05-20 at 1 10 33 PM

Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)

@MatthewFlamm
Copy link
Contributor Author

Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)

We can't fix the server not returning valid data for 90 minutes without cache busting, which is against the terms of use. This PR brings extra robustness in the HA integration while still marking the observations as invalid when applicable. The current state in the dev branch will continue to show data > 1 hr old, and users will not be able to know this.

@kamiyo
Copy link
Contributor

kamiyo commented May 21, 2024

This might be a separate PR, but should we add a sensor for the data being stale?

@MatthewFlamm
Copy link
Contributor Author

This might be a separate PR, but should we add a sensor for the data being stale?

The observation data will obviously be stale in this PR when the data is marked as "unavailable" in the entity. A sensor that holds the time of the observation sounds fine to me in a separate PR.

Unfortunately the forecast data model isn't the best in HA in regards to this and I'm planning on opening an issue about it.

@bdraco
Copy link
Member

bdraco commented May 22, 2024

Screenshot 2024-05-21 at 9 01 14 PM

It seems to work pretty well now as long as I don't restart. Once I restart I have to wait a bit before the data is valid again

Comment on lines -191 to +306
with patch("homeassistant.components.nws.utcnow") as mock_utc:
with patch("homeassistant.components.nws.coordinator.utcnow") as mock_utc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we use FrozenDateTimeFactory instead of patching utcnow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep PR semi-manageable. This was changed to accommodate the change in location, not a change in functionality. I agree that these tests needs to be updated in the future.

@bdraco bdraco added the smash Indicator this PR is close to finish for merging or closing label May 22, 2024
@bdraco bdraco merged commit 52bb02b into home-assistant:dev May 22, 2024
24 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 23, 2024
@MatthewFlamm MatthewFlamm deleted the nws-retry-no-data branch May 24, 2024 15:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
by-code-owner cla-signed dependency integration: nws new-feature Quality Scale: platinum small-pr PRs with less than 30 lines. smash Indicator this PR is close to finish for merging or closing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants