Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected rejection of items when calling timegaps periodically (with "accept youngest item in category" method) #3

Open
murphyke opened this issue May 23, 2014 · 18 comments

Comments

@murphyke
Copy link

I'm generating backups every 5 minutes.

Using the following command:

timegaps -d recent12,hours24,days7,weeks4,months3 *.Fc

I am seeing a rolling set of 13 files. Timegaps doesn't seem to be keeping the hourly files. I was hoping to keep 12 files for the most recent hour, then 1 file per hour for the most recent day, etc, etc. Is my usage correct?

@jgehrcke
Copy link
Owner

Hey Kevin,

your usage looks valid. From the information given, I cannot see if there is an error in the time categorization logic of timegaps (I consider this as an option, but it is already quite heavily tested). It could also be that there is a mismatch between your expectations what the program should do and the program specification. It could also be that your file modification times do not behave as you think.

I assume that you have read through the extended help, although I explain some important bits here again:

In any case, the recent12 rule will accept up to 12 items if they are younger than one hour. So there probably is one item of age ~0 (just created), one item every five minutes, one item of age ~55 min, one item of age ~60 min. The latter is not included in recent, but should be the youngest one accepted as of hours24.

So the 13 items you are seeing as being accepted: Are these the 12 recent ones + 1 one-hour-old item?

Obviously, if you have run your regular backup creation for longer than a couple of hours, the hours rule should provide you with one item per hour. I do not know why this does not happen for you.

The logic itself works, it is covered by various unit tests. For instance, we have a unit test where each time category (recent, hours, days, ..) is represented with at least 60 different time count values (1 hours old, 2 hours old, ...60 hours old), which becomes filtered by the rules recent5,hours48,days10,weeks6,months12,years4. There is one "reducing overlap" between hours and days, so in that case we expect 84 items to be accepted, and this works. There are more tests somehow covering your case, also direct command line tests, but of course there might still be a weird mean constellation that I did not cover so far.

I strongly recommend leaving out --delete until we clarified that, and for further debugging this issue I definitely need to understand or even reproduce your situation as close as possible. In a situation where timegaps rejects in your opinion too many items -- can you provide me with as much useful information as possible?

The output of timegaps (possibly in debug mode with -vv), the file system contents including file modification times (an ls output or something comparable, or direct age of items would also be great, see http://superuser.com/q/169858), everything that you think might be important. That would be super great!

@jgehrcke
Copy link
Owner

I just want to let you know, before you invest more time into this: I have re-created your application scenario and will see if I observe the same issue.

@jgehrcke
Copy link
Owner

Bad news, you have found a fundamental design flaw. The time categorization algorithm must be conceptually safe. It is something that needs to be designed quite carefully, and when I used pen and paper for making the first drafts, I created concepts that turned out to be wrong in some way. I hoped that the current algorithm was conceptually safe, but it is not. The flaw is trivial to understand once you see it, but not easily covered by unit tests, since unit tests try to test the concept, and not necessarily beyond the concept.

That is the problem: all items become categorized into buckets, whereas one bucket represents a time category (e.g. hours) and a time count (e.g. 2). Let us name this example bucket hours:2, it would contain all items with an age of 2 hours <= age < 3 hours. Only one item per bucket is to be accepted. The current algorithm accepts the youngest of the items in one bucket, and rejects all others. When invoked frequently enough, this means that items can be rejected (and removed, if you say so) before switching to the next older bucket (of higher time count within the same category, e.g. from hours:1 to hours:2). If deleted, such an item is then lost.

In your case, after a while there are two items in the hours:1 bucket. The oldest one becomes rejected, i.e. drops out, and you chose to delete/move it. Upon the next invocation, the oldest in hours:1 is the one that was accepted before, but now drops out and becomes replaced by one that is 5 minutes newer. No item ever ends up in the hours:2 bucket, let alone "older" buckets.

One-time and infrequent invocations of timegaps provide the expected result. The problem you have observed is the result of quite frequent invocations of timegaps. The problem to appear requires item creation and timegaps invocation on a time scale shorter than the time span covered by the same time count in a certain category. This is severe, since what you observe with 5-minute creation and invocation and hour-categorization can likewise happen with weekly creation and invocation, and year-categorization.

A way to fix this conceptual flaw would be to accept the oldest item in a bucket, and to reject all younger ones. That way, the oldest item in a certain bucket is carried forth until it jumps into the next one, e.g. from hours:1 to hours:2. Sounds reasonable at the first glance. The reason why I did not implement it that way in the first place is that I thought it would be best to always keep the newest backup in a certain bucket.

I need to think about possible side effects of this approach in other application scenarios. In case I believe that this new approach is safe, I will adjust the implementation, the unit tests, and the documentation.

Primarily, I am a bit embarrassed that such a booboo made it into a release. However, this algorithm was not the result of a longer scientific study, so I had to consider that there still are conceptual problems. And finding those requires tests. This is a very good example that broad real-world tests are at least as important as unit tests. You are one of the first users of timegaps I am aware of, so you had to carry that pain. I hope that you did not suffer important data loss, and big thanks for the feedback!!

@murphyke
Copy link
Author

I'm happy to help improve something that will be potentially useful to people. Good luck making this robust.

The 'accept oldest' approach seems reasonable in the application of managing backups, as long as it results in the first daily backup being 24 hours older than the oldest hourly backup.

For testing, perhaps you could come up with a way of abstracting the time periods, so they could be redefined as completely different (much shorter) intervals for the purposes of the tests.

@jgehrcke
Copy link
Owner

Thanks. So far, the 'accept oldest' approach looks reasonable, also with respect to the criterion you mentioned. I am however still watching out for possible race conditions and edge cases and did not find the time yet to make these things sure.

While what you propose for testing for sure would work, I find "time simulation" more straight-forward and easier to debug (this can easily be done by taking control of item modification time as well as reference time). Based on this, I am currently implementing more complex unit tests that can easily cover for instance the usage scenario you have reported here.

Thanks for your willingness to help, I will get back to you when this will be of use. Before that, however, I need to get the current code to a solid foundation again. It will take some time!

@jgehrcke
Copy link
Owner

Btw, abstracting the time periods to arbitrary ones was once also a goal of mine. However, the casual user should not be forced to think in very complex terms, so the software must provide all required intelligence (and at least print warnings for obviously dangerous configurations). I soon concluded that all this is much easier with a fixed set of time categories. It for sure would be possible to come up with a generic approach, but making it work reiable is quite difficult I think. Maybe something for the future, after all "lessons learned" for a fixed set of categories.

@jgehrcke jgehrcke changed the title Question about usage Unexpected rejection of items when calling timegaps periodically Jun 2, 2014
@jgehrcke
Copy link
Owner

jgehrcke commented Jun 2, 2014

Update: In my thought experiments I came across another constellation that might lead to unexpected results also with the accept-oldest-approach. I need to come up with a more systematic approach.

@rschwietzke
Copy link

@jgehrcke I really like timegaps because it addresses my need of removing older btrfs snapshots. Thank you very much. But I hit the same defect, basically removing the latest snapshot all the time. Are you still developing timegaps?

@jgehrcke
Copy link
Owner

Dear René, I very much appreciate your comment here. It always was my plan, and still is, to consolidate the algorithmic core of timegaps to make it not suffer from this issue. With this bug report I realized that the initial concept for this core part was not thought through well enough. My plan then was to not fix the issues as they appear with obvious patches, but to revise the entire concept. Life got inbetween and this is, as always, a matter of priorities. This particular issue here pops into my mind again and again, and with your comment in place I feel especially motivated to resolve it. However, I cannot make any promise about the timeline. Cheers!

@rschwietzke
Copy link

@jgehrcke I would do it myself, but I am horrible in Python.... just a Java guy... sorry. But I volunteer to test and review if you like. Thanks again for the code and thanks for making it open source.

@lapseofreason
Copy link

It might be worthwhile to look at how the retention policy of btrbk. It implements a similar retention policy for btrfs snaphots.

They seem to have come across the same problem and have changed the retention policy to keeping the first backup in each bucket in version 0.23.

The project is quite active, so their policy should have been tested in practice.

Btw, thanks for this useful tool, I'm surprised that it is not more widely known as I think this is quite a common use case.

@jgehrcke
Copy link
Owner

jgehrcke commented Nov 8, 2017

@lapseofreason thank you so much for leaving this feedback, for connecting the dots. In fact, I have paper notes somewhere with a hopefully consolidated algorithm that I believe is not affected by the drift discussed in this issue here. I am right now not sure if the simple change done in the btrbk project is equivalent to what I have in my notes. I have not been 100 % satisfied with my consolidated algorithm, because I still don't have formal proof that it retains all data as expected.

I want to leave a quote here, from btrbk's upgrade notes, explaining this commit:

Preserve first instead of last snapshot/backup

btrbk used to always transfer the latest snapshot to the target location, while considering the last snapshot/backup of a day as a daily backup (and also the last weekly as a monthly). This made it very cumbersome when running btrbk in a cron job as well as manually, because the last manually created snapshot was immediately transferred on every run, and used as the daily backup (instead of the one created periodically by the cron job).

The new semantics are to consider the first (instead of last) snapshot of a hour/day/week/month as the one to be preserved, while only transferring the snapshots needed to satisfy the target retention policy.

@dvhirst
Copy link

dvhirst commented Dec 11, 2017

@jgehrcke - Thanks for this very useful script. I'm running into the same issue with this command line:
/usr/local/bin/timegaps -m /mnt/ --time-from-basename daily-%Y%m%d-%H%M%S.tar.7z
recent2,days7,weeks5,months12 *.tar.7z >>
/var/log/sitebu.log
For now, I'll fix it by keeping the last 30 days of daily backups. I'd really like to have the more complete set of backups, in case an infrequently maintained system needs same for recovery. I'm eager to see your fix in place. Thanks.

@dvhirst
Copy link

dvhirst commented Mar 14, 2018

@jgehrcke - Is there any chance you'll be fixing this issue at some point? I'm still using the script, but it's caused me problems on a couple of occasions. Thanks.

@jgehrcke
Copy link
Owner

Is there any chance you'll be fixing this issue at some point?

There is a chance, yes :). Thanks for expressing your interest. That's probably more important than people sometimes think.

@thedaveCA
Copy link

I’m definitely interested too, I just didn’t want to add noise of a “me too”. I love the product and its functionality outside of this one hiccup.

@dvhirst
Copy link

dvhirst commented Sep 27, 2018

@jgehrcke - Thanks for your response; I'm still eager to see this potentially very useful utility running properly. It would make my work much easier and more complete. Also, I suspect a properly working utility with the capability you describe and seek to provide could become decidedly more popular than it is now. Best wishes.

@dvhirst
Copy link

dvhirst commented Jan 21, 2019

@jgehrcke - I remain interested in this utility and continue to hope that you will take it up again and provide a fully functional release. Thanks.

@jgehrcke jgehrcke changed the title Unexpected rejection of items when calling timegaps periodically Unexpected rejection of items when calling timegaps periodically (with "accept youngest item in category" method) Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants