-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected loss of items with "accept oldest item in category" approach #7
Comments
How come that the total item count invariant is violated in run 288? What's special about run 288? Which dynamics led up to this loss of an item? For starters, that's absolutely non-obvious. So, back then, in December 2017, I did some rudimentary visualization to help understanding these dynamics. Sharing the plots now, with some brief discussion. Let's look at this plot: This shows the distribution of items for the last few individual timegaps runs. The y axis shows an arbitrary unit. The distribution of items in time directly after run 288 is visualized by the topmost row. We kind of intuitively see that the second item from the left is missing. There seems to be a "hole" in there. But we also see that another item went missing before, in run 276 (the 4th row from the bottom). It went missing without violating the total item count invariant. Which events led to the item density distribution change unexpectedly during run 276? Once we can answer this question we understand the problem much better. One obvious goal from here is to find another more strict invariant that fails the experiment already on run 276 (or earlier). |
(this is a lovely example for how a really simple set of rules can result in totally unintuitive dynamics -- physics of complex systems is full of stuff like that). |
What happened around run 275/276? Right after run 275 the time differences between three selected items are:
It's fair to say that we have at least one item per 24 hours among those three items shown (how come that the exact difference is 23 hours? Don't need to answer this here but I want to say that this again is far from obvious and an interesting result of the dynamics.) After run 276, the time differences are significantly shifted:
All individual item categories are still populated. But the time distance between particular neighboring items shifted so that it became larger than "what we want". Now we don't have one item per 24 hours anymore. From here we seemingly can formulate one more helpful invariant, for testing or even for the filtering algorithm itself. |
One invariant that should be maintained across timegaps runs in this type of experiment: the filtered items, sorted by time from newest to oldest, must have a time difference between adjacent items that only grows towards the past, and never shrinks (is monotonic). In code: def np_is_ascending(array):
return np.all(array[:-1] >= array[1:])
def sort_items(items):
# Sort items from newest (first) to oldest (last).
return sorted(items, key=lambda i: i.modtime)
def inspect_item_deltas(items):
# Sort items from newest (first) to oldest (last).
# (that is not so far an API guarantee, but maybe it should be one)
# Calculate time difference between adjacent items.
sorted_items = sort_items(items)
deltas = [b.modtime - a.modtime for a, b in pairwise(sorted_items)]
# Test for invariant: the filtered items, sorted by time from newest to
# oldest, must have a time difference between adjacent items that only grows
# towards the past, and never shrinks. (is monotonic).
assert np_is_ascending(np.array(deltas))
return deltas Tried this with the example scenario described above; this indeed failed for run 276. That's a better / more strict invariant than the total item count invariant. |
For my project (zfs_autobackup) i started experimenting with a progressive thinning algorithm. (https://github.com/psy0rz/zfs_autobackup/blob/bf985998b38a23d3dd6263f89d177b2310493c54/zfs_autobackup#L832) Instead of unit tests I opted to test it using "fuzzy" timestamps at random intevals, to prevent exactly the problems you had. (this was before i found your project) One thing i realized: DONT use relative time to define the buckets! Since then all the buckets "shift" all the time, you will get all kinds of weird and wonderfull behaviour. Instead use absolute epoch time to identify the buckets. e.g.: bucket_nr=int(snapshot_time_in_seconds_since_epoch/period) This way a snapshot always stays in the same bucket, nothing is shifting. This gave very good results. The reason i went looking at other solutions is because i have this problem: Example: User only chooses to keep one yearly snapshot. That snapshot will be removed as soon as it is 365 days old. At that time the "next oldest" will become the yearly snapshot. This one however can be quite young. (if there are other rules used, otherwise it will be the snapshot of right now) So i have some ideas how to solve it, but i was looking for a clever way, or a library that already solved these issues. I think looking at your approach helped me to get a better idea for a solution. |
Ok i completed the thinning-classes among with some test-code. Maybe someone can use it: https://github.com/psy0rz/zfs_autobackup/blob/34d0c5d67bd483239ad625265226327b1d7a417c/zfs_autobackup#L82 (its a permalink, also check the latest version, in the future. :) |
Is there a rule-set for the existing implementation that will safely thin out a set of backups over a multi-year period, resulting in a stable use of storage, e.g., in the cloud? The rule-set should allow for keeping 1 backup that is a year old, 12 backups that are less than 1 year old, with one per month, up to 1 year, and the 30 most recent days. I had tried in 2018 to set up a sequence like this and ran into a failure, hence my prior communications. I see no new release of the utility, so I presume that the flaw mentioned in this post still exists. I am still interested in making use of this almost highly useful utility. Thanks. Don Hirst |
Hello, I'm still following this thread. Would be extremely gratified to see an update that fixes the problem. For now, I am using this to manage a simple 30 day thinning, but would really prefer to use my initial pattern, which doesn't work with the current implementation. Thanks. Don Hirst |
2021 01 04; 21:04 PDT: Just checking in to see if there is any chance the error described above will be fixed with a new release. Would greatly appreciate such action! |
@dvhirst thank you for your endless interest and patience! :) I really don't know if I will ever get back to this in this current life, haha. I better do not make any promises. The passion is there -- I would love to work on this again! Let's see. |
A truly elegant solution, if it works properly! Good luck, and many thanks for thinking on the issue.
DVH
…-----Original Message-----
From: Dr. Jan-Philip Gehrcke ***@***.***>
To: jgehrcke/timegaps ***@***.***>
Cc: Don Hirst ***@***.***>; Mention ***@***.***>
Sent: Mon, Sep 13, 2021 1:29 am
Subject: Re: [jgehrcke/timegaps] Unexpected loss of items with "accept oldest item in category" approach (#7)
@dvhirst thank you for your endless interest and patience! :) I really don't know if I will ever get back to this in this current life, haha. I better do not make any promises. The passion is there -- I would love to work on this again! Let's see.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
I’m trying to find a workaround for this in what I believe should be your base case: A set of daily backups which need thinning, but where deleting the most recent is unacceptable. In normal testing, since the daily backup is at 3 AM and I don’t get up that early, there doesn’t seem to be a way to do this. But if I schedule the cron job within an hour of the backup completing, a test with a bogus file suggests that the most recent might be preserved with a pattern containing I’ve noticed that there are a few forks of this project; does any of them implement protection of the last file? Should I install a version of timegaps before the change? Could I just alter the source on our disk so that it never deletes the most recent item? (Which should be the default in my opinion) |
I love the provoking thought here btw, @psy0rz. But I have to argue that thinning out items makes sense based on age, and age is by definition a duration relative to now. |
Yeah, but you look at the age of a bucket. And a bucket should always be at the same point in time, using the method i described. I wrote quite a few regression tests for this and many people have been using it succesfully in zfs-autobackup for years now. So it works, without quirks :) |
Link to the latest version: https://github.com/psy0rz/zfs_autobackup/blob/master/zfs_autobackup/Thinner.py |
Thanks for the discussion! Hm. So. These buckets would not have a predictable relationship to the concept of 'age'. However, the user only thinks in terms of age of items. With buckets defined absolutely in time, the user would need to provide rules based on absolute time, too (they would say "keep one item pear year for the years 2000-2022" as opposed to "keep one item per year for the last 22 years", and then their rules would become out-of-date as time progresses). Looking at the code in the Thinner.py you linked. What I see is item age being calculated relative to now: age = int(now) - timestamp And then this age is later compared per-rule "ttl" which seems to be the maximum age in seconds (that would be the per-rule bucket boundary at the 'past' end of the bucket): for rule in self.rules:
if age <= rule.ttl:
block_nr = int(timestamp / rule.period)
... I only had this brief look, but it seems like the per-rule buckets have moving boundaries in time, relative to now (as it makes sense to me :-)).
I don't mean this in a negative way, but do you have more tests than https://github.com/psy0rz/zfs_autobackup/blob/4873913fa8ff6464373e947a333087498d8b0d53/tests/test_thinner.py? |
i forgot to mention this: https://github.com/psy0rz/zfs_autobackup/wiki/Thinner#technical-details-about-the-thinner Maybe that clears it up? Those are indeed the tests, but i hand verified those results. |
The age is offcourse relative, but the buckets have an absolute timeframe. |
My proposed workaround for the base case seemed to work last night, but it’s uncomfortably dependent on hard-coded assumptions about the timing. Coverity backs up its database locally at 3 AM; I rescheduled the Here are the relevant crontab pieces:
Until I’m confident that we’re fortunate in the timing, I will be manually deleting the destination of the |
@psy0rz this is ambiguous to me. Do you want to say that each bucket has a specific (time) width (of course it has :D and I suppose you're trying to say something else and I don't understand yet). If the algorithm you have implemented is "keep the newest item per category" then maybe it also suffers from the conceptual difficulty described in #3.
Thanks for linking this. Love the usage of visualization, as it always helps to clarify concepts. (as you can see above I am also a great fan of visualizing things). @FlashSheridan thanks for the thoughts. If you find a workaround that works for you -- great! For me what you write however only confirms that there's work left to do. As long as the retention 'success' depends on the timegaps invocation frequency and/or timegaps invocation time the approach is flawed (as in 'timegaps does not live up to its promise yet'). |
But it’s incredibly ugly, and requires bright red warning text in the internal documentation against others using the tool—which we agreed in a recent meeting (before I found
Not much work for a quick workaround for the base case — an option not to delete the most recent file. (Generalizing it to more complicated cases, #5, might be more difficult, but that’s irrelevant to the base case, which would be easy.) |
In the meantime, I’ve switched to python-rotate-backups. |
Okay so a number of years ago we switched from "keep youngest item per category" (first release of timegaps im 2014) to "keep oldest item per category (in
master
branch since 2014, not released) as of the discussion in #3.We were pretty confident that changing the approach made sense, also supported by @lapseofreason who wrote in 2017:
In 2014 I wrote
Sadly I didn't note down details, but I believed past Jan-Philip and was not confident making another release w/o much more exhaustive and systematic testing. Around Christmas 2017 I did more systematic testing based on certain invariants and indeed I found a conceptual problem with the "accept oldest item in category" approach, also leading to data loss. I never took the time to publicly document this because I wanted to further simplify the test that failed. Didn't happen in more than a year so now I simply note the details of the exact experiment that failed.
With the 'simple' keep-oldest-item approach the following experiment revealed a problem:
Initial condition:
Have
7779000 / 300 = 25931
items spread across a timeline of about 3 months with a time delta of precisely 5 minutes between adjacent items.Set reference time arbitrarily, generate items:
First timegaps run
With these rules (arbitrarily chosen):
Expected item count: 50
After the first run exactly 50 items remained.
Simulate constant influx of items and periodic invocation of timegaps
Simulate the following:
reftime
into the future add a new item.Check for invariant: after every timegaps run see if the number of accepted items is still 50.
Unexpected failure:
The check for the invariant failed for run 288 where only 49 items remained.
The text was updated successfully, but these errors were encountered: