Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rsnapshot appears to be sporadically recopying, rather than hotlinking, files which have not changed. #339

Open
2 tasks done
kupietools opened this issue May 2, 2024 · 3 comments
Labels

Comments

@kupietools
Copy link

kupietools commented May 2, 2024

Checklist

  • I'm reporting a bug
  • I've searched the bugtracker for similar issues, including closed ones.

What did you do?

Just using rsnapshot as directed, it seems to be working in every other way besides this.

What happened?

I got a message on my server admin panel that my backup drive is out of inodes. Using rsnapshot-diff to compare backups, I've found backups where instead of hardlinking unchanged files, it recopies them... the diff says the files were removed and added again, even though they never changed on disk.

For instance, if I do rsnapshot-diff -v rsnapshot/hourly.160 rsnapshot/hourly.84, I get lines like

+ rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
- rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
+ rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/41/cf/41cfeee5884a43a4650a851f4f85e7b28316fcc9
- rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/41/cf/41cfeee5884a43a4650a851f4f85e7b28316fcc9
+ rsnapshot/hourly.84/localhost/home/wserver/moodledata/localcache/.lastpurged
- rsnapshot/hourly.160/localhost/home/wserver/moodledata/localcache/.lastpurged

That moodledata folder is an archive of a project from almost 10 years ago, I haven't touched that folder in ages and don't have anything running that uses it... I took down Moodle in probably 2016, that was two complete server migrations ago.

If I do diff -bur rsnapshot/hourly.84/localhost/home/wserver/moodledata/ rsnapshot/hourly.160/localhost/home/wserver/moodledata/ I get no changes at all reported, all files are identical:

adminguy@wserver:/mnt/backups# diff -bur rsnapshot/hourly.84/localhost/home/wserver/moodledata/ rsnapshot/hourly.160/localhost/home/wserver/moodledata/
adminguy@wserver:/mnt/backups# 

Using ls, I can see that one of the listed files has indeed been removed and recopied as a new file with a different inode, the link counts are different, but the size and 9-year-old mod date are identical:

adminguy@wserver:/mnt/backups# ls -ial rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
141346 -rw-rw-rw- 46 wserver wserver 1230 Mar  6  2015 rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
adminguy@wserver:/mnt/backups# ls -ial rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
1050837 -rw-rw-rw- 93 wserver wserver 1230 Mar  6  2015 rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30

Just to confirm those files are in fact identical:

adminguy@wserver:/mnt/backups# diff rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30 rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
adminguy@wserver:/mnt/backups# 

and md5sum says so too:

adminguy@wserver:/mnt/backups# md5sum rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
ce358f7b3f99d8df31677ef86e988ded  rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
adminguy@wserver:/mnt/backups# md5sum rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
ce358f7b3f99d8df31677ef86e988ded  rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30

I thought maybe some extended attributes might have changed, but it does't appear there are any:

adminguy@wserver:/mnt/backups# getfattr -d -m ^ -R -- rsnapshot/hourly.160/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
adminguy@wserver:/mnt/backups# getfattr -d -m ^ -R -- rsnapshot/hourly.84/localhost/home/wserver/moodledata/filedir/d6/13/d613d55f37bb76d38d4ffb4b7b83e6c694778c30
adminguy@wserver:/mnt/backups#

So, these are the same file, nothing has changed but somewhere in there, it didn't hardlink them, it definitely considered it a new file even though it's not, and made a new separate copy of it.

Every time it does this, it does it with about 5GB of files at once (ordinary hourly.nnn backups have 250MB-500MB changed.) I can see from sudden 10x jumps in the backup folder size in rsnapshot du that it appears to do this sporadically, once every few weeks, resulting in additional disk space and a huge number of inodes consumed over time. It's almost like something fundamental is happening, occasionally telling it "this is now a new drive; these are all new files, back up new copies of all these files instead of hardlinking to the last backup's versions" even though the files haven't changed.

Here's some excerpts from the output of rsnapshot du, showing where it happened:

adminguy@wserver:/mnt/backups/rsnapshot# rsnapshot du
481M	/mnt/backups/rsnapshot/hourly.0/
5.6G	/mnt/backups/rsnapshot/hourly.5/ <---- note 5.6G
5.9G	/mnt/backups/rsnapshot/hourly.6/ <---- note 5.9G
543M	/mnt/backups/rsnapshot/hourly.7/
462M	/mnt/backups/rsnapshot/hourly.8/
522M	/mnt/backups/rsnapshot/hourly.9/
384M	/mnt/backups/rsnapshot/hourly.10/
270M	/mnt/backups/rsnapshot/hourly.11/

Those above two happened within the last few days, and I haven't worked on the server at all in well over a week, so it was nothing I did. hourly.5 appears to be what maxed out the drive's inodes so backups couldn't continue until I went in and deleted a ton of stuff.

Kind of scary, I was running without any backups being made for a few days and I didn't know.

Here's another one from weeks ago:

199M	/mnt/backups/rsnapshot/hourly.92/
281M	/mnt/backups/rsnapshot/hourly.93/
226M	/mnt/backups/rsnapshot/hourly.94/
201M	/mnt/backups/rsnapshot/hourly.95/
222M	/mnt/backups/rsnapshot/hourly.96/
245M	/mnt/backups/rsnapshot/hourly.97/
343M	/mnt/backups/rsnapshot/hourly.118/
4.8G	/mnt/backups/rsnapshot/hourly.119/ <--------- note sudden 4.8G size, almost 20x average
94M	/mnt/backups/rsnapshot/hourly.120/
924M	/mnt/backups/rsnapshot/hourly.121/
199M	/mnt/backups/rsnapshot/hourly.122/
224M	/mnt/backups/rsnapshot/hourly.123/
214M	/mnt/backups/rsnapshot/hourly.124/
313M	/mnt/backups/rsnapshot/hourly.125/
216M	/mnt/backups/rsnapshot/hourly.126/
231M	/mnt/backups/rsnapshot/hourly.127/

Thanks!

What did you expect to happen

All copies of any unchanged file, across all hourly.nnn folders, should be hardlinks to the same file.

My configuration

config_version  1.2
snapshot_root   /mnt/backups/rsnapshot/
no_create_root  1
cmd_cp          /bin/cp
cmd_rm          /bin/rm
cmd_rsync       /usr/bin/rsync
cmd_logger      /usr/bin/logger
cmd_du          /usr/bin/du
cmd_rsnapshot_diff      /usr/bin/rsnapshot-diff
cmd_postexec    /usr/bin/journalctl --vacuum-time=2d
retain  hourly  1440
verbose         2
loglevel        3
lockfile        /var/run/rsnapshot.pid
exclude 'wp-content/updraft'
exclude '/var/lib/mysql'
exclude '*log.*.gz'
exclude '*log.gz'
exclude 'home/*/tmp'
exclude 'home/*/.cache'
exclude 'wp-content/wflogs'
exclude 'wp-content/cache/docket-cache'
link_dest       1
backup  /home/          localhost/
backup  /etc/           localhost/
backup  /usr/local/     localhost/
backup  /opt/   localhost/
backup  /srv/   localhost/
backup  /root/  localhost/
backup  /var/   localhost/

Environment

OS: Debian12 running on a VMware Fusion VM (haven't reconfigured or changed the VM or its drives at any time during this)
Filesystem: ext4

Other information

No response

@kupietools kupietools added the bug label May 2, 2024
@kupietools
Copy link
Author

kupietools commented May 2, 2024

Hold it. This is really weird. I just noticed something. Look at the list above. After hourly.119 above, hourly.118 ran... and then suddenly, nothing for about 10 days, until 245M /mnt/backups/rsnapshot/hourly.97/. That's an odd coincidence, isn't it?

I can see from ls -al:

drwxr-xr-x   3 root root 4096 Mar 13 23:12 hourly.96
drwxr-xr-x   3 root root 4096 Mar 13 11:26 hourly.97
drwxr-xr-x   3 root root 4096 Mar  2 23:11 hourly.118
drwxr-xr-x   3 root root 4096 Mar  2 11:13 hourly.119

I didn't have an inode error then, or any indication backups weren't running... and hourly.97 isn't a very big backup according to rsnapshot du, either, so it seems like maybe it's still hardlinking most of its files to hourly.118 despite the gap in numbers. But that's still a weird coincidence, that backups hourly.119 and hourly.6 were both suddenly huge, and in both cases, the next backup ran, but then after that the backups silently failed for a while.

Wait, yep, I can see with ls -i that the inode numbers do change at the gaps... I spot-checked a bunch of files with ls -i rsnapshot/hourly.*/localhost/[path to unique file] | sort -n and the inode numbers always change after the gaps hourly.118 and hourly.5. It doesn't look like anything after the gaps is hardlinked to anything before, it's always a whole new set of fresh files.

So, this raises a few questions:
1.) How did this happen? I didn't delete any backup folders, I have no idea why they stopped and started again between 118 and 97. The fact that none of the files I checked are hardlinked between those two suggests it wasn't backing up, it was incrementing the folder numbers but then the backups were failing for 10 days, until they mysteriously started again. I unfortunately didn't have logging enabled. (I've shut the barn door behind the horse by turning this on now.)

2.) Is it possible to make it be able to do the hardlinks to the previous folder, whatever that folder's number is, even if there is a number missing in-between? If, as I suspect, it's starting a whole new backup with a whole new set of files every time the previous backup failed, that seems inefficient... at least for as long as backups are failing without me knowing why.

3.) Alternatively, if the last backup failed, why not have it NOT advance all the numbers next time? Don't advance hourly.1 to hourly.2 if hourly.0 doesn't exist (meaning the last backup failed.) Just try to create hourly.0 again without incrementing first. This would completely solve the problem. IE, If folder 0 doesn't exist and you have folders 1 2 3, DON'T increment them into 2 3 4 before creating 0 so you end up with folders numbered 0 2 3 4 and 0 is a completely new set of un-hardlinked file copies. Instead, leave them as 1 2 3 so 0 can still hardlink to 1 if need be. Then you'll have 0 1 2 3, and NEXT time, you can increment them to 1 2 3 4 before creating a new 0. (Or am I misunderstanding how all this works? The main point is: one way or another, it seems to me that it should never create a completely fresh set of un-hardlinked new backup files just because one or more backups didn't run successfully. Obviously it is not that unusual for a backup to not run for some reason, and without a notification system to let us know when it's happened, this is a recipe for backup drives silently filling, as evidenced by it happening to me.)

4.) Is it possible to somehow 'stitch together' the gap, so files in hourly.97 and forwards are replaced with hardlinks to the identical files in hourly.118 and previous (and ditto for the new hourly.0 and the identical-but-not-hardlinked files in hourly.5)? Or is deleting everything but the latest backup the only way I'm going to be able to reclaim all that space? That would make me sad. (EDIT: I found https://github.com/pauldreik/rdfind but it doesn't quite work as hoped... if you have two separate groups of hardlinks to identical files, you can't just combine them, you have to run it once for each separate duplicate you want to link... tough with over 140 backup sets to reattach to each other.)

4.) Why are some backups suddenly 10x-20x the average size of other backups from around the same time? Can this be avoided somehow? I don't know whether this is the cause or the effect of the gaps in backup sequence, but it does happen right before the gaps, so it's suspect.

5.) I know this has been asked about before but I couldn't find a concrete answer that wasn't ancient: email notifications for failed backups? Pretty-please?

Thanks.

@DrHyde
Copy link
Member

DrHyde commented May 2, 2024

On your last point, you should get email notifications from cron, showing all the console output. If you're not getting anything, try turning the verbosity up a tick. Also make sure emails from cron are getting through - temporarily create a job something like * * * * * echo I am alive to test this.

It is very odd that you've got 20 backups just missing.

At this point it would be really handy to see the relevant bits of your logfile. Did logger log anything to syslog? Unfortunately you don't an rsnapshot-specific logfile configured. For now, I think that all I can suggest is that you configure one, and then get back to us when it happens again.

@kupietools
Copy link
Author

Thanks for your response! Re email notification: Ah, yeah, I turned off emails from cron wholesale, I was getting deluged by notices of successful cron job runs, so anything I actually needed to know about was getting buried, and besides, most of the scripts I need to know the status of already email me separately. I'll see if I can set something up more granular. But it would be nice if I could just tick a setting in rsnapshot like I can in some of my other utilities that send email notifications of problems.

As to the missing backups... I'll set up a logfile, but right now the most important thing to me in my torrent of verbiage above is point # 3, not why there were failed backups but the effects of them. I'm less concerned right now with why the occasional backup problems happened—I can troubleshoot that—than with the fact that when rsnapshot missed a backup, for whatever reason, the subsequent run after that it still seems to advance hourly.1 to hourly.2, leaving a gap, and causing it to make a whole new backup since it now doesn't see a hourly.1 to hardlink unchanged files from, so that next backup to once again fully consumes as much space and inodes on the backup drive as the entire set files being backed up.

This means an occasional backup problem, for whatever reason (network connectivity, name it) can cause your backup drive to fill all available space or inodes far more quickly than expected, which (out of inodes) is exactly what happened to me. So I'd sure like to see a solution to that. If that could be solved, then occasionally missing a backup, for reasons yet to be troubleshot, isn't as much of a worry for me.

I suppose as a kludge I could have a script run before every rsnapshot backup that makes sure hourly.0 exists, and if not, renames hourly.1 back to hourly.0 so rsnapshot then renames it back to hourly.1 and hardlinks against it rather than to hourly.2 and ignoring it for hardlinking purposes, but that seems sort of hack-y and fragile, and I'd rather see a feature within rsnapshot proper that handled this in a project-developer-approved way, if possible.

I'm now positive, by the way, that what I listed as point # 4 was just the effects of a whole new fresh backup set being made rather than an incremental one... I was unclear because I misunderstood the order rsnapshot du measures file sizes in for purposes of reporting changed files only. I eventually fixed it with jdupes and recovered a bunch of drive space and inodes from my backup sets, but, with hundreds of large backup sets, I'd rather avoid ever thrashing my drive doing that again if possible.

Ok! Thanks for your work on this project, btw, despite this one issue, overall I adore it, it's a really well-made utility and a real lifesaver for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants