showing full command line when lockfile exists #237

kberry · 2019-08-19T00:25:14Z

For the reasons I wrote in #236, it would be nice if rsnapshot showed the full command line of the running command when it aborts due to the lock file existing. That way it is easier to know which level is taking longer than expected, so the cron times can be adjusted.

I hacked this in using /proc/$pid/cmdline (and then changing the nulls used in that interface to spaces), but clearly it would be more portable to put the cmdline args into the lockfile and then cat the file, or any of myriad other approaches. Sorry, but this is what I've got to show.

Thanks for considering.

@@ -2452,2 +2452,4 @@
-                       print_err("Lockfile $lockfile exists and so does its process, can not continue");
-                       syslog_err("Lockfile $lockfile exists and so does its process, can not continue");
+                   chomp (my $cmdline = `cat /proc/$pid/cmdline`);
+                   $cmdline =~ s/\0/ /g;
+                       print_err("Lockfile $lockfile exists and so does its process $pid ($cmdline), can not continue");
+                       syslog_err("Lockfile $lockfile exists and so does its process $pid ($cmdline), can not continue");

The text was updated successfully, but these errors were encountered:

sgpinkus · 2019-08-19T07:50:29Z

Why do you need to see the full command line? What extra information is available from the full command line? It just going to be more or less "rsnapshot -c /my/config/file some-level" isn't it? And you'll see that in your log files if your running from cron (grep "CRON" /var/log/syslog) and same for systemd anyway?

Also not sure how portable "/proc/$pid/" is.

kberry · 2019-08-19T21:15:43Z

it is precisely the which is interesting to see. When the lock file exists, the whole question is, what level created it (i.e., is taking longer to run than expected). Yes, the information is in the various log files, but my experience is that it is not simple to correlate the pid from the "lock file exists" mail with the process that created the lock file and the process that aborted. Painful enough that I went to the trouble of hacking the code.

Right, as I said, I also doubt /proc/$pid/cmdline is universally portable. It was just the simplest method so I could figure out my cron timing problems. Clearly it would simple enough to write the cmdline to the lock file and read it back, to avoid going through /proc.

sgpinkus · 2019-08-19T22:55:24Z

Fair enough. Yes to pid, no to full command line. It's overkill. Feel free to PR.

davidcantrelluk2 · 2019-08-22T10:47:05Z

I think I'd be OK with showing the command line, but note that in the patch it would show the first process's pid but the second process's command line, which is definitely not what you want!

Something like this would be OK I think if verbosity is turned up:

Lockfile blahblah exists and so does its process, can not continue:
  existing process 12345: /usr/bin/rsnapshot daily
  current process 23456: /usr/bin/rsnapshot weekly

As for how you get the command line - you're right, using /proc like this is not portable, it's a Linuxism. procfs exists in most modern Unixes (OpenBSD being a notable exception) but doesn't necessarily expose all the information needed. You'd need to implement it using something like Proc::ProcessTable (don't look at the source, it's a horrid lump of platform-specific C which will make your eyes bleed). Obviously that would have to be an optional dependency just like how rsnapshot will use Lchown if available.

However, to solve @kberry's original problem - the rule of thumb is that you set up your cron jobs so that the least frequent runs first, then the next least frequent, and so on up to the most frequent. Except for the most frequent, they should all run very fast as all they're doing is renaming directories. Make sure that you have lazy deletes turned on so that deletion of expired backups is delayed to the end of the process and done after the lock file has been removed. My crontab looks like this ...

15    2    1   8   *   /usr/local/bin/rsnapshot yearly
30    2    1   *   *   /usr/local/bin/rsnapshot monthly
45    2    *   *   6   /usr/local/bin/rsnapshot weekly
0     3    *   *   *   /usr/local/bin/rsnapshot daily

So my yearly job shuffles things around at 0215, the monthly one at 0230, the weekly at 0245, and the daily does all the hard work at 0300. In practice just a one or two minute gap should be sufficient even on a heavily IO-loaded machine provided you go in that order - basically enough time to spin-up the disks if necessary, plus a few seconds.

kberry · 2019-08-23T01:09:43Z

thanks!! lack of use_lazy_deletes was my whole problem, hence the less-frequent runs were potentially taking significant time to finish. I was using the rsnapshot packaged for centos7, where it is not enabled by default :(. Should have looked harder. Thanks again.

davidcantrelluk2 · 2019-08-23T09:07:15Z

Perhaps we should change that default. Thoughts, people?

kberry · 2019-08-23T13:21:29Z

Is there a downside to changing the default?

In any case, perhaps it would be worth adding a caveat to the use_lazy_deletes description in the conf file, and/or the "USAGE" section in the man page, and/or the FAQ on the web site, that the upper levels can take longer. The man page now recommends increasing the time between backups when there is collisions, ("If rsnapshot takes longer than 10 minutes ...) which is what led me down this path.

kberry · 2019-08-23T13:24:22Z

Also, I thought this idea was dead, but in any case, I'm confused about the need for Proc::ProcessTable. The email from cron already shows the command line of the process that gets aborted. What I needed was the command line of the process that held the lock. Seems like this could easily be reported by outputting the command line into the lock file as in print join (" ", @argv)) and then reading it back.

However, perhaps an even more useful, and simpler, idea is to have the abort message say "maybe you should set use_lazy_deletes" for backups not at the first level and if the directive isn't already set.

Just an idea. Thanks again.

StefanMoch · 2019-08-23T19:02:24Z

Perhaps we should change that default. Thoughts, people?

Having use_lazy_deletes as the default would make sense as it decouples the real work (the backup) from the housekeeping (deleting old directories), so that a longer running delete does not block the next backup.

Only one downside comes to my mind at the moment: if the process is interrupted, the _delete.$$ directory has to be deleted manually.

sgpinkus · 2019-08-23T23:52:35Z

👍 agree it should be the default. Downsides: Uses a snapshot worth more disk space, and what @StefanMoch mentions. Maybe should add a warning about unhandled "_delete.$$" dirs. Can't see much risk in adding code to detect and restarting "rm _delete.$$" in code either. But even without that I think the benefits outweigh the costs here. Would need a major version bump though.

sgpinkus mentioned this issue Aug 19, 2019

log pid in $run_string #236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

showing full command line when lockfile exists #237

showing full command line when lockfile exists #237

kberry commented Aug 19, 2019

sgpinkus commented Aug 19, 2019 •

edited

kberry commented Aug 19, 2019

sgpinkus commented Aug 19, 2019

davidcantrelluk2 commented Aug 22, 2019 •

edited

kberry commented Aug 23, 2019

davidcantrelluk2 commented Aug 23, 2019

kberry commented Aug 23, 2019

kberry commented Aug 23, 2019

StefanMoch commented Aug 23, 2019 •

edited

sgpinkus commented Aug 23, 2019

showing full command line when lockfile exists #237

showing full command line when lockfile exists #237

Comments

kberry commented Aug 19, 2019

sgpinkus commented Aug 19, 2019 • edited

kberry commented Aug 19, 2019

sgpinkus commented Aug 19, 2019

davidcantrelluk2 commented Aug 22, 2019 • edited

kberry commented Aug 23, 2019

davidcantrelluk2 commented Aug 23, 2019

kberry commented Aug 23, 2019

kberry commented Aug 23, 2019

StefanMoch commented Aug 23, 2019 • edited

sgpinkus commented Aug 23, 2019

sgpinkus commented Aug 19, 2019 •

edited

davidcantrelluk2 commented Aug 22, 2019 •

edited

StefanMoch commented Aug 23, 2019 •

edited