Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Generate a Bullet Proof Backup #153

Closed
jonnylangefeld opened this issue Jun 27, 2022 · 6 comments
Closed

How to Generate a Bullet Proof Backup #153

jonnylangefeld opened this issue Jun 27, 2022 · 6 comments

Comments

@jonnylangefeld
Copy link

jonnylangefeld commented Jun 27, 2022

TL;DR: I have three questions regarding backup:

  1. what could be the reason that not all my zigbee devices show up in the network backup when following these steps?
  2. Is there a way to backup my Zigbee dongle without disabling the ZHA integration in HA? Or if not, what is a good way to automate that backup in a cron job?
  3. Can I fully restore my zigbee network with the backup I get after following these steps?

I spent my weekend re-adding 30 zigbee devices via ZHA after my SONOFF ZigBee 3.0 gave out. That was especially annoying for my SONOFF ZBMINI devices because they are in the walls behind switches and they can only be but in pairing mode by pressing the physical button on them. It all happened when I wanted to pair a new device I just bought. First I noticed that the device didn't show up when opening ZHA for inclusion mode and then I noticed that none of my existing devices respond anymore. I only got errors like this in the log, which were also reported in #124

2022-06-25 07:10:04 WARNING (MainThread) [homeassistant.components.zha.core.channels.base] [0xAD5A:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Request failed after 5 attempts: <Status.APS_NO_ACK: 183>'), DeliveryError('Request failed after 5 attempts: <Status.APS_NO_ACK: 183>'), DeliveryError('Request failed after 5 attempts: <Status.APS_NO_ACK: 183>'), DeliveryError('Request failed after 5 attempts: <Status.APS_NO_ACK: 183>')]

2022-06-25 07:09:45 WARNING (MainThread) [homeassistant.components.zha.core.channels.base] [0x03EE:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Request failed after 5 attempts: <Status.NWK_NO_ROUTE: 205>'), DeliveryError('Request failed after 5 attempts: <Status.NWK_NO_ROUTE: 205>'), DeliveryError('Request failed after 5 attempts: <Status.NWK_NO_ROUTE: 205>'), DeliveryError('Request failed after 5 attempts: <Status.NWK_NO_ROUTE: 205>')]

I even had a backup of my home assistant /config directory, but I learned during this incident that this didn't help and that my Zigbee stick has local memory. So since then I dove deep into how I can backup my Zigbee dongle and followed these steps described in TOOLS.md.

However I did notice that not all my devices made it into the network backup. I have 30 devices (excluding the dongle) that show in my ZHA integration in the home assistant UI. But the network backup only contains 27:

$ /home/umbrel/home-assistant# cat data/sonoff-backup/sonoff-network-1656303167.json | jq '.devices | length'
27

What could be the reason that not all devices in the ZHA integration show up in the network backup? In the nvram backup under the ADDRMGR key I also see only 27 addresses filled.
The DEVICE_LIST looks even more weird, but this might just be my lack of understanding of it.

    "DEVICE_LIST": {
        "0x0000": "3b3e000004080000ff000000feffffff",
        "0x0001": "ffffffffffffffffff000000ffffffff",
        "0x0002": "ffffffffffffffffff310020ffffffff",
        "0x0003": "e06a110004080000ff310020feffffff",
        "0x0004": "c1a4140001040000ff310020003c0000",
        "0x0005": "ffffffffffffffffff310020ffffffff",
        "0x0006": "ffffffffffffffffff310020ffffffff",

The rest of keys is all ffff.

Can I still rely on the backup to restore everything?

Also is there a way to get the backup without disabling the ZHA integration in HA? So far I always got an error that the serial port is already in use when I tried to run the backup while the ZHA integration is running. If not, what is a good way to still automate the backup?

@puddly
Copy link
Collaborator

puddly commented Jun 27, 2022

  1. The coordinator doesn't need to keep track of every device on the network so not every device will show up in the backup. This is normal. You could even clear that section entirely and the restore would still work, though it may temporarily affect end devices joined directly to the coordinator without an intermediate router.

  2. This isn't currently possible since you can only connect to the serial port with one program at a time. Automated backups will be done by ZHA in the near future.

  3. Network state is distributed among all of the devices on your network, while the backup is just for the coordinator's state. It isn't a complete backup but it's the best you can do.

Realistically, no critical portion of the backup changes other than the frame counters:

$ diff 20220401_002152.json 20220411_005802.json
7c7
<             "creation_time": "2022-04-01T00:21:52+00:00",
---
>             "creation_time": "2022-04-11T00:58:02+00:00",
25c25
<         "frame_counter": 54933583
---
>         "frame_counter": 55054471

so you can get away with doing a single initial backup, another one a week later, and then see how much your frame counters increment every day. Pass double or triple the expected counter value to the restore command with --counter-increment and then restore.

I spent my weekend re-adding 30 zigbee devices via ZHA after my SONOFF ZigBee 3.0 gave out.

Did you do a backup/restore after it stopped working? Do you happen to have a backup?

The DEVICE_LIST looks even more weird, but this might just be my lack of understanding of it.

The NVRAM backup is completely opaque: there's very little human-readable information in it since it's a raw dump of the internal structures of the stick.

@jonnylangefeld
Copy link
Author

Thank you so much for your swift reply! Good to know that not all devices need to be in the backup. Regarding Nr. 2, I found this since I posted, which seems to allow to call the backup from within home assistant. The blueprint of that repository has an example of a nightly backup. They don't mention over there to disable the ZHA integration first. Maybe the toolkit works because it's still within home-assistant, so the same user of the serial port? I will try that out.

Thanks for the hint regarding the frame counter!

Did you do a backup/restore after it stopped working? Do you happen to have a backup?

After it stopped working I only had a backup of my /config of home assistant including zigbee.db (which I restored, but didn't help), but no backup of my SONOFF ZigBee 3.0 dongle. I didn't know that dongle had local storage that needed to be backed up and painfully learned now.
I did backup the broken state (cause that's when I started reading up about backups of the dongle), at the time when nothing was working anymore, including all devices that have been reliably working for months. Just out of curiosity I did analyze that broken-state backup and I noticed something interesting: There were 72 devices in that backup, but 52 of them did not have the link_key object:

╰─ cat sonoff.json| jq '.devices | length'
72
╰─ cat sonoff.json| jq '[.devices[] | select(.link_key == null)] | length'
52
╰─ cat sonoff.json| jq '[.devices[] | select(.link_key != null)] | length'
20

At the time of that backup I must have had ~24 actual physical devices (I have 30 now as said in my original post, but the whole corruption of the network happened as I was adding some devices I newly bought). Definitely no where near 72. Do you think the corruption could have to do with that?

If you're curious I can send you the backup during the corrupted state on some private channel?

@jonnylangefeld
Copy link
Author

I just re-read your question

Did you do a backup/restore after it stopped working? Do you happen to have a backup?
And think you might be asking if I actually backed up after it stopped working and then restored. Not sure if I'm interpreting this right 😄

  • When it stopped working I did not have a backup from before it stopped working, because I didn't know about the controller backups
  • After it stopped working and before I recreated a whole new network to fix it all, I created a network and nvram backup of that broken state. Since everything was broken and I felt like I couldn't break any more, I even removed the 52 devices that had no link_key in the backup via
    cat sonoff.json | jq '.devices |= map(select(.link_key != null))' > sonoff-changed.json
    
    and then loaded that network backup with the 20 remaining devices, but that did not help anything. I even did a zigpy_znp.tools.nvram_read and a zigpy_znp.tools.nvram_write of that broken state backup, but that did not help either. I still have those backup files if you're curious.

Now that I recreated the network everything is actually working nicer than before, because I did the energy scan and found that channel 25 is way less busy than channel 15 for me. So while it was really annoying to re-pair everything, I actually have hope that I improved everything in the process. Now that everything is working again and I have 30 devices, I see this on the backup:

╰─ cat data/sonoff-backup/sonoff-network-1656303167.json| jq '[.devices[] | select(.link_key != null)] | length'
26
╰─ cat data/sonoff-backup/sonoff-network-1656303167.json | jq '.devices | length'
27

@puddly
Copy link
Collaborator

puddly commented Jun 27, 2022

Glad it worked out for you in the end.

Can you email me all of the broken and working NVRAM and network backups? I'm curious to see what broke, and if it's something that I can potentially correct in software.

@jonnylangefeld
Copy link
Author

just sent via email!

@puddly
Copy link
Collaborator

puddly commented Jun 27, 2022

Got it, thanks.

There are many child devices in the backup with bogus IEEE addresses so I suspect this is some bug with the firmware. It may have been possible to fix by deleting them from the backup and then restoring but since you're running a new network it's a moot point.

In the future, the zigpy_znp.tools.network_backup tools will be deprecated in favor of the zigpy CLI so be aware that the latter format is the more up-to-date one.

@puddly puddly closed this as completed Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants