Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

showing full command line when lockfile exists #237

Open
kberry opened this issue Aug 19, 2019 · 10 comments
Open

showing full command line when lockfile exists #237

kberry opened this issue Aug 19, 2019 · 10 comments

Comments

@kberry
Copy link

kberry commented Aug 19, 2019

For the reasons I wrote in #236, it would be nice if rsnapshot showed the full command line of the running command when it aborts due to the lock file existing. That way it is easier to know which level is taking longer than expected, so the cron times can be adjusted.

I hacked this in using /proc/$pid/cmdline (and then changing the nulls used in that interface to spaces), but clearly it would be more portable to put the cmdline args into the lockfile and then cat the file, or any of myriad other approaches. Sorry, but this is what I've got to show.

Thanks for considering.

@@ -2452,2 +2452,4 @@
-                       print_err("Lockfile $lockfile exists and so does its process, can not continue");
-                       syslog_err("Lockfile $lockfile exists and so does its process, can not continue");
+                   chomp (my $cmdline = `cat /proc/$pid/cmdline`);
+                   $cmdline =~ s/\0/ /g;
+                       print_err("Lockfile $lockfile exists and so does its process $pid ($cmdline), can not continue");
+                       syslog_err("Lockfile $lockfile exists and so does its process $pid ($cmdline), can not continue");
@sgpinkus
Copy link
Member

sgpinkus commented Aug 19, 2019

Why do you need to see the full command line? What extra information is available from the full command line? It just going to be more or less "rsnapshot -c /my/config/file some-level" isn't it? And you'll see that in your log files if your running from cron (grep "CRON" /var/log/syslog) and same for systemd anyway?

Also not sure how portable "/proc/$pid/" is.

@kberry
Copy link
Author

kberry commented Aug 19, 2019

it is precisely the which is interesting to see. When the lock file exists, the whole question is, what level created it (i.e., is taking longer to run than expected). Yes, the information is in the various log files, but my experience is that it is not simple to correlate the pid from the "lock file exists" mail with the process that created the lock file and the process that aborted. Painful enough that I went to the trouble of hacking the code.

Right, as I said, I also doubt /proc/$pid/cmdline is universally portable. It was just the simplest method so I could figure out my cron timing problems. Clearly it would simple enough to write the cmdline to the lock file and read it back, to avoid going through /proc.

@sgpinkus
Copy link
Member

Fair enough. Yes to pid, no to full command line. It's overkill. Feel free to PR.

@davidcantrelluk2
Copy link

davidcantrelluk2 commented Aug 22, 2019

I think I'd be OK with showing the command line, but note that in the patch it would show the first process's pid but the second process's command line, which is definitely not what you want!

Something like this would be OK I think if verbosity is turned up:

Lockfile blahblah exists and so does its process, can not continue:
  existing process 12345: /usr/bin/rsnapshot daily
  current process 23456: /usr/bin/rsnapshot weekly

As for how you get the command line - you're right, using /proc like this is not portable, it's a Linuxism. procfs exists in most modern Unixes (OpenBSD being a notable exception) but doesn't necessarily expose all the information needed. You'd need to implement it using something like Proc::ProcessTable (don't look at the source, it's a horrid lump of platform-specific C which will make your eyes bleed). Obviously that would have to be an optional dependency just like how rsnapshot will use Lchown if available.

However, to solve @kberry's original problem - the rule of thumb is that you set up your cron jobs so that the least frequent runs first, then the next least frequent, and so on up to the most frequent. Except for the most frequent, they should all run very fast as all they're doing is renaming directories. Make sure that you have lazy deletes turned on so that deletion of expired backups is delayed to the end of the process and done after the lock file has been removed. My crontab looks like this ...

15    2    1   8   *   /usr/local/bin/rsnapshot yearly
30    2    1   *   *   /usr/local/bin/rsnapshot monthly
45    2    *   *   6   /usr/local/bin/rsnapshot weekly
0     3    *   *   *   /usr/local/bin/rsnapshot daily

So my yearly job shuffles things around at 0215, the monthly one at 0230, the weekly at 0245, and the daily does all the hard work at 0300. In practice just a one or two minute gap should be sufficient even on a heavily IO-loaded machine provided you go in that order - basically enough time to spin-up the disks if necessary, plus a few seconds.

@kberry
Copy link
Author

kberry commented Aug 23, 2019

thanks!! lack of use_lazy_deletes was my whole problem, hence the less-frequent runs were potentially taking significant time to finish. I was using the rsnapshot packaged for centos7, where it is not enabled by default :(. Should have looked harder. Thanks again.

@davidcantrelluk2
Copy link

Perhaps we should change that default. Thoughts, people?

@kberry
Copy link
Author

kberry commented Aug 23, 2019

Is there a downside to changing the default?

In any case, perhaps it would be worth adding a caveat to the use_lazy_deletes description in the conf file, and/or the "USAGE" section in the man page, and/or the FAQ on the web site, that the upper levels can take longer. The man page now recommends increasing the time between backups when there is collisions, ("If rsnapshot takes longer than 10 minutes ...) which is what led me down this path.

@kberry
Copy link
Author

kberry commented Aug 23, 2019

Also, I thought this idea was dead, but in any case, I'm confused about the need for Proc::ProcessTable. The email from cron already shows the command line of the process that gets aborted. What I needed was the command line of the process that held the lock. Seems like this could easily be reported by outputting the command line into the lock file as in print join (" ", @argv)) and then reading it back.

However, perhaps an even more useful, and simpler, idea is to have the abort message say "maybe you should set use_lazy_deletes" for backups not at the first level and if the directive isn't already set.

Just an idea. Thanks again.

@StefanMoch
Copy link
Contributor

StefanMoch commented Aug 23, 2019

Perhaps we should change that default. Thoughts, people?

Having use_lazy_deletes as the default would make sense as it decouples the real work (the backup) from the housekeeping (deleting old directories), so that a longer running delete does not block the next backup.

Only one downside comes to my mind at the moment: if the process is interrupted, the _delete.$$ directory has to be deleted manually.

@sgpinkus
Copy link
Member

👍 agree it should be the default. Downsides: Uses a snapshot worth more disk space, and what @StefanMoch mentions. Maybe should add a warning about unhandled "_delete.$$" dirs. Can't see much risk in adding code to detect and restarting "rm _delete.$$" in code either. But even without that I think the benefits outweigh the costs here. Would need a major version bump though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants