Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sensorname] become unavailable #140

Open
gessl opened this issue Feb 15, 2024 · 23 comments
Open

[Sensorname] become unavailable #140

gessl opened this issue Feb 15, 2024 · 23 comments
Assignees
Labels
bugfix Inconsistencies or issues which will cause a problem for users or implementers. no-stale This issue or PR is exempted from the stable bot.

Comments

@gessl
Copy link

gessl commented Feb 15, 2024

Roughly every 3 minutes, the sensor will become unavailable with the following entries in the log file:

[Sensorname] became unavailable
4:05:27 AM - In 1 second
[Sensorname] Online became unavailable
4:05:27 AM - In 1 second
[Sensorname] Heating became unavailable
4:05:27 AM - In 1 second

It resumes after about a minute.

@robbinjanssen
Copy link
Owner

I'm guessing this is UWG4 related? Since all of my sensors have been available since I've updated this morning.
@adamjernst are you experiencing similar issues?

@robbinjanssen robbinjanssen added the bugfix Inconsistencies or issues which will cause a problem for users or implementers. label Feb 15, 2024
@adamjernst
Copy link
Contributor

Yeah I've seen this too. Need to investigate and add some logging. Not sure if it's transient API issues or what.

@adamjernst
Copy link
Contributor

I see it much more intermittently—only every few hours. I do suspect it might be related to transient timeouts, as mentioned in #139. I am adding more logging to my local instance to see what I can determine.

@adamjernst
Copy link
Contributor

@gessl A logging patch I am using locally:

diff --git a/custom_components/ojmicroline_thermostat/coordinator.py b/custom_components/ojmicroline_thermostat/coordinator.py
index 0c4b877..f9b0b6e 100644
--- a/custom_components/ojmicroline_thermostat/coordinator.py
+++ b/custom_components/ojmicroline_thermostat/coordinator.py
@@ -54,13 +54,24 @@ class OJMicrolineDataUpdateCoordinator(DataUpdateCoordinator):
             UpdateFailed: An error occurred when updating the data.
 
         """
+        _LOGGER.debug("Updating data")
         try:
             async with async_timeout.timeout(API_TIMEOUT):
                 thermostats = await self.api.get_thermostats()
+                _LOGGER.info(
+                    "Fetched thermostats: %s",
+                    ", ".join(t.serial_number for t in thermostats),
+                )
                 return {resource.serial_number: resource for resource in thermostats}
 
         except OJMicrolineAuthError as error:
+            _LOGGER.warning('Auth error updating data')
             raise ConfigEntryAuthFailed from error
 
         except OJMicrolineError as error:
+            _LOGGER.warning('Error updating data: %s', error)
             raise UpdateFailed(error) from error
+
+        except Exception as e:
+            _LOGGER.warning('Unknown error updating data: %s', e)
+            raise

You will need to bump up the log level for this component using e.g. the Developer Tools → Services pane:

service: logger.set_level
data:
  custom_components.ojmicroline_thermostat: debug

I am waiting for the issue to reproduce, but it's pretty rare for me. If you can reproduce and get logs please share them here!

@gessl
Copy link
Author

gessl commented Feb 15, 2024

@adamjernst I did that and suddenly it is happening scarcely. I have seen one unavailable but I hadn't set the log levels correctly. I wonder if adding logging provides enough delay to make this problem rarer? I'll post a log once I catch something useful.

@gessl
Copy link
Author

gessl commented Feb 15, 2024

@adamjernst Probably doing something wrong with the logging. Had an unavailable event but am not seeing the custom log entries.

image

@adamjernst
Copy link
Contributor

Hmm odd. Let's wait and see if it reproduces for me.

@adamjernst
Copy link
Contributor

I'm seeing the same behavior where a thermostat goes unavailable without anything obviously wrong in the logs. Time for… more logging! Stay tuned...

@adamjernst
Copy link
Contributor

My new theory is that the API is simply returning Thermostat models with online=False, which nukes all the data and results in the symptoms you're seeing. If I verify that assumption, we can discuss what to do about it.

@gessl
Copy link
Author

gessl commented Feb 16, 2024

I have to admit I'm fairly fresh to HA but as such I have observed that other entities go unresponsive too, such as nest cameras and there is no clear reason why that would be. Maybe this is what you are getting at with Thermostat returning online=false, i.e. it's some HA behavior that is actually contributing to this?

@robbinjanssen
Copy link
Owner

@gessl unfortunately we cannot blame HA here 😉 The online=False actually is returned by the python-ojmicroline package (which both Adam and I did a bunch of work in).

@adamjernst Maybe the API used is unstable/unreachable at times? Maybe add a ping sensor to the host used by your Thermostat, just to make sure the problem is the package, not the server?

@adamjernst
Copy link
Contributor

adamjernst commented Feb 16, 2024

@robbinjanssen I still haven't reproduced the issue, but my guess is that both the API & python package are perfectly fine but the thermostat itself has dodgy wifi (your original guess!) — and thus the API just returns a Thermostat with online=False and no data.

If I verify that this is the case, the question is what to do about it. I lean towards modifying the coordinator class to serve stale data for a few minutes; if the thermostat doesn't come back online within ~5 minutes, it should revert to serving what the API is actually giving. I'm curious for your thoughts (but no need to start debating until we verify that's the problem here).

edit: I guess we are really saying the same thing, anyway — yes I can add a ping sensor to verify this assumption… working on it now

This problem would also be solved if we switch from cloud-polling to using the notifications-based API. So I might just do that and be done with the whole thing.

@gessl
Copy link
Author

gessl commented Feb 16, 2024

@adamjernst For what it's worth I have not seen it go unavailable for more than a minute, so the idea to serve stale information might work well in practice just from what I have seen. If there is a clean solution that would of course be best.

@robbinjanssen
Copy link
Owner

@adamjernst

This problem would also be solved if we switch from cloud-polling to using the notifications-based API. So I might just do that and be done with the whole thing.

Do you have more information about the notification based API?

@adamjernst
Copy link
Contributor

adamjernst commented Feb 18, 2024

@robbinjanssen Sure! It's just HTTP long poll on an endpoint /api/notification, e.g.:

https://mythermostat.info/api/notification?sessionid=REDACTED&sequencenr=0

If something changes, a payload like the following is delivered:

{
    "SequenceNr": 3,
    "Action": 2,
    "Thermostat": {
        "SerialNumber": "1414444",
...

If nothing changes, the request times out after 60 seconds and is immediately retried.

A few notes:

  • Interestingly the sequencenr parameter is always 0, although the payload does include a sequence number. Seems like they never quite finished the job, and I would expect it's possible to drop notifications if your timing is bad; normally a sequence number would ensure you're getting every notification, but I guess they didn't finish it…
  • Using this doesn't reduce the total number of requests we make to the API, since we're still making a request every 60 seconds. But presumably this notification request is cheaper than requesting all thermostat data every 60 seconds.
  • The other principal advantage to this notification approach is that we get faster updates (near-instant instead of waiting for the next poll interval) and we no longer have to worry about the weird propagation/out-of-order issues that happen with polling.
  • There is also a separate long-polling endpoint that listens for new thermostats, but I feel like that's rare enough to make it not worth worrying about.

Does the WD5 API have a similar endpoint?

@adamjernst
Copy link
Contributor

adamjernst commented Feb 18, 2024

OK, I think reproduced this issue:

2024-02-17 10:54:48.797 INFO (MainThread) [custom_components.ojmicroline_thermostat.coordinator] Fetched thermostats: XXXXXX1, XXXXXX2, XXXXXX3
2024-02-17 10:54:48.797 DEBUG (MainThread) [custom_components.ojmicroline_thermostat.coordinator] Thermostat XXXXXX1 has online True vacation 2024-02-12 00:00:00-06:00 thru 2024-02-16 00:00:00-06:00
2024-02-17 10:54:48.797 DEBUG (MainThread) [custom_components.ojmicroline_thermostat.coordinator] Thermostat XXXXXX2 has online False vacation 2024-02-12 00:00:00-06:00 thru 2024-02-16 00:00:00-06:00
2024-02-17 10:54:48.798 DEBUG (MainThread) [custom_components.ojmicroline_thermostat.coordinator] Thermostat XXXXXX3 has online True vacation 1970-01-01 00:00:00-06:00 thru 1970-01-01 00:00:00-06:00
2024-02-17 10:54:48.798 DEBUG (MainThread) [custom_components.ojmicroline_thermostat.coordinator] Finished fetching ojmicroline_thermostat data in 0.399 seconds (success: True)

This results in the following logbook entries:

Bunkroom Bathroom Floor Online  was connected
10:55:48 AM - 12 hours ago
Bunkroom Bathroom Floor Online  was disconnected
10:54:48 AM - 12 hours ago

That's subtly different from what you describe, @gessl; you say that your sensors are marked unavailable, not disconnected...

I think "unavailable" might happen when the data fetch routine returns False or throws, instead of merely returning a disconnected thermostat. So back to the original idea: add some http retries...

@gessl
Copy link
Author

gessl commented Feb 18, 2024

@adamjernst I'm not sure. As you see above when it comes back it says floor (my thermostat's name) Online is connected. But it drops as "unavailable". But notice that in #139 I noted both an Error and a timeout, and have seen both. In short, I'm not sure but this could well be the same phenomenon just reporting slightly differently. Interestingly I had ping running and pings are not disrupted.

@robbinjanssen
Copy link
Owner

Does the WD5 API have a similar endpoint?

I'd have to reverse engineer the app for that again, as there is no documentation about the WD5 API. I'll try to figure it out asap

@robbinjanssen
Copy link
Owner

After looking back on some older posts on the HA community forum, I might have found something.

https://community.home-assistant.io/t/mwd5-wifi-thermostat-oj-electronics-microtemp/445601


GET https://owd5-mh015-app.ojelectronics.com/ocd5notification/negotiate?clientProtocol=1.3

Response:
{
    "ConnectionId": "7fe6ba31-be25-44cb-ac72-6b62e3ce5ecb",
    "ConnectionTimeout": 110.0,
    "ConnectionToken": "3H5PCkeD1lLHiLvVS0xklpe5apgHYIudkycXcHbUwFv2Qzy/4EFQF8SXwBT7fssqUb4UpnwIGoNsnH63nl1+M/Av2/L3fo/3UinL42Yqphf9Br3fQyeGStSXJ4kirwpc",
    "DisconnectTimeout": 30.0,
    "KeepAliveTimeout": 20.0,
    "LongPollDelay": 0.0,
    "ProtocolVersion": "1.3",
    "TransportConnectTimeout": 5.0,
    "TryWebSockets": true,
    "Url": "/ocd5notification"
}


----

GET https://owd5-mh015-app.ojelectronics.com/ocd5notification/connect?transport=serverSentEvents&connectionToken=3H5PCkeD1lLHiLvVS0xklpe5apgHYIudkycXcHbUwFv2Qzy%2F4EFQF8SXwBT7fssqUb4UpnwIGoNsnH63nl1%2BM%2FAv2%2FL3fo%2F3UinL42Yqphf9Br3fQyeGStSXJ4kirwpc&connectionId=7fe6ba31-be25-44cb-ac72-6b62e3ce5ecb
No response

This is a call to some sort of ocd5notification endpoint, is yours similar?

@robbinjanssen
Copy link
Owner

robbinjanssen commented Feb 20, 2024

Seems like this is it, there is some negotiation and connect going on here:
Screenshot 2024-02-20 at 09 06 45

But then one stream gets stuff like:

{
  "C": "xxx|hmu,5",
  "M": [
    {
      "SequenceNr": 2,
      "Groups": [],
      "Thermostats": [],
      "ThermostatRealTimes": [
        {
          "Action": 2,
          "Id": 123,
          "SerialNumber": "xx",
          "Online": true,
          "RoomTemperature": 2077,
          "FloorTemperature": 2008,
          "Heating": false,
          "SensorAppl": 3
        }
      ]
    }
  ]
}

and:

{
  "C": "xxx|hmu,4",
  "M": [
    {
      "SequenceNr": 1,
      "Groups": [
        {
          "Action": 2,
          "GroupId": 123,
          "GroupName": "Woonkamer",
          "Thermostats": [],
          "RegulationMode": 1,
          "Schedule": {
            "I_REMOVED_THIS":"FOR_READING_PURPOSES"
          },
          "ComfortSetpoint": 2350,
          "ComfortEndTime": "2024-02-20T12:02:55",
          "ManualModeSetpoint": 2500,
          "VacationEnabled": false,
          "VacationBeginDay": "2022-11-15T00:00:00",
          "VacationEndDay": "2022-11-16T00:00:00",
          "VacationTemperature": 500,
          "LastPrimaryModeIsAuto": true,
          "BoostEndTime": "2023-11-30T10:06:00",
          "FrostProtectionTemperature": 500
        }
      ],
      "Thermostats": [],
      "ThermostatRealTimes": []
    }
  ]
}

@robbinjanssen
Copy link
Owner

So I was able to reproduce this ping-pong in python, all 3 calls are needed to get the notification.

The first one is to /negotiate, it gives me a connectionId and connectionToken.
The second one can be fired together with the third one. The /send endpoint is a POST with the connection ID and Token and the body contains the session ID. The /connect is the keep-alive stream that receives all the data.

You only require one call, right @adamjernst ?

@adamjernst
Copy link
Contributor

@robbinjanssen correct! If you build notification support for WD5 thermostats and just leave it unimplemented for WG4, I will fill it in and test it...

@robbinjanssen
Copy link
Owner

robbinjanssen commented Feb 22, 2024

I'll open up an issue in the python-ojmicroline-thermostat repo :-)

Edit: robbinjanssen/python-ojmicroline-thermostat#226

@robbinjanssen robbinjanssen added the no-stale This issue or PR is exempted from the stable bot. label Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Inconsistencies or issues which will cause a problem for users or implementers. no-stale This issue or PR is exempted from the stable bot.
Projects
None yet
Development

No branches or pull requests

3 participants