Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected loss of items with "accept oldest item in category" approach #7

Open
jgehrcke opened this issue Mar 6, 2019 · 23 comments
Open

Comments

@jgehrcke
Copy link
Owner

jgehrcke commented Mar 6, 2019

Okay so a number of years ago we switched from "keep youngest item per category" (first release of timegaps im 2014) to "keep oldest item per category (in master branch since 2014, not released) as of the discussion in #3.

We were pretty confident that changing the approach made sense, also supported by @lapseofreason who wrote in 2017:

“It might be worthwhile to look at how the retention policy of btrbk. It implements a similar retention policy for btrfs snaphots. They seem to have come across the same problem and have [changed the retention policy] digint/btrbk@bd34d9f) to keeping the first backup in each bucket in version 0.23.

In 2014 I wrote

“Update: In my thought experiments I came across another constellation that might lead to unexpected results also with the accept-oldest-approach. I need to come up with a more systematic approach.”

Sadly I didn't note down details, but I believed past Jan-Philip and was not confident making another release w/o much more exhaustive and systematic testing. Around Christmas 2017 I did more systematic testing based on certain invariants and indeed I found a conceptual problem with the "accept oldest item in category" approach, also leading to data loss. I never took the time to publicly document this because I wanted to further simplify the test that failed. Didn't happen in more than a year so now I simply note the details of the exact experiment that failed.

With the 'simple' keep-oldest-item approach the following experiment revealed a problem:

Initial condition:

Have 7779000 / 300 = 25931 items spread across a timeline of about 3 months with a time delta of precisely 5 minutes between adjacent items.

Set reference time arbitrarily, generate items:

        reftime = 1514000000.0
        modtimes = (reftime - n * 300 for n in range(0, 25930+1))
        items = [FilterItem(modtime=t) for t in modtimes]

First timegaps run

With these rules (arbitrarily chosen):

    recent12,hours24,days7,weeks5,months2

Expected item count: 50
After the first run exactly 50 items remained.

Simulate constant influx of items and periodic invocation of timegaps

Simulate the following:

  • Every 5 minutes from reftime into the future add a new item.
  • At the same time, after adding the item, run timegaps.

Check for invariant: after every timegaps run see if the number of accepted items is still 50.

Unexpected failure:

The check for the invariant failed for run 288 where only 49 items remained.

@jgehrcke
Copy link
Owner Author

jgehrcke commented Mar 6, 2019

How come that the total item count invariant is violated in run 288? What's special about run 288? Which dynamics led up to this loss of an item? For starters, that's absolutely non-obvious.

So, back then, in December 2017, I did some rudimentary visualization to help understanding these dynamics. Sharing the plots now, with some brief discussion.

Let's look at this plot:

item_lost_in_run288

This shows the distribution of items for the last few individual timegaps runs. The y axis shows an arbitrary unit.

The distribution of items in time directly after run 288 is visualized by the topmost row. We kind of intuitively see that the second item from the left is missing. There seems to be a "hole" in there.

But we also see that another item went missing before, in run 276 (the 4th row from the bottom). It went missing without violating the total item count invariant.

Which events led to the item density distribution change unexpectedly during run 276? Once we can answer this question we understand the problem much better. One obvious goal from here is to find another more strict invariant that fails the experiment already on run 276 (or earlier).

@jgehrcke
Copy link
Owner Author

jgehrcke commented Mar 6, 2019

(this is a lovely example for how a really simple set of rules can result in totally unintuitive dynamics -- physics of complex systems is full of stuff like that).

@jgehrcke
Copy link
Owner Author

jgehrcke commented Mar 6, 2019

What happened around run 275/276?

Right after run 275 the time differences between three selected items are:

    hours/24 (2017-12-23 02:38:20)
                                    23 hours
      days/1 (2017-12-22 03:38:20)
                                    23 hours
      days/2 (2017-12-21 04:38:20)

It's fair to say that we have at least one item per 24 hours among those three items shown (how come that the exact difference is 23 hours? Don't need to answer this here but I want to say that this again is far from obvious and an interesting result of the dynamics.)

After run 276, the time differences are significantly shifted:

    hours/24 (2017-12-23 03:38:20)
                                     1 hour
      days/1 (2017-12-23 02:38:20)
                                    46 hours
      days/2 (2017-12-21 04:38:20)

All individual item categories are still populated. But the time distance between particular neighboring items shifted so that it became larger than "what we want". Now we don't have one item per 24 hours anymore.

From here we seemingly can formulate one more helpful invariant, for testing or even for the filtering algorithm itself.

@jgehrcke
Copy link
Owner Author

jgehrcke commented Mar 6, 2019

One invariant that should be maintained across timegaps runs in this type of experiment: the filtered items, sorted by time from newest to oldest, must have a time difference between adjacent items that only grows towards the past, and never shrinks (is monotonic).

In code:

def np_is_ascending(array):
    return np.all(array[:-1] >= array[1:])


def sort_items(items):
    # Sort items from newest (first) to oldest (last).
    return sorted(items, key=lambda i: i.modtime)


def inspect_item_deltas(items):
    # Sort items from newest (first) to oldest (last).
    # (that is not so far an API guarantee, but maybe it should be one)
    # Calculate time difference between adjacent items.
    sorted_items = sort_items(items)
    deltas = [b.modtime - a.modtime for a, b in pairwise(sorted_items)]

    # Test for invariant: the filtered items, sorted by time from newest to
    # oldest, must have a time difference between adjacent items that only grows
    # towards the past, and never shrinks. (is monotonic).
    assert np_is_ascending(np.array(deltas))
    return deltas

Tried this with the example scenario described above; this indeed failed for run 276. That's a better / more strict invariant than the total item count invariant.

@jgehrcke
Copy link
Owner Author

New visualization technique showing the evolution of individual items across filter runs (by keeping color constant for the same item across filter runs):
item_lost_new_visualization2019

Every red cross shows a rejected item. All of the rejections are expected, except for one of them.

@psy0rz
Copy link

psy0rz commented Oct 21, 2019

For my project (zfs_autobackup) i started experimenting with a progressive thinning algorithm. (https://github.com/psy0rz/zfs_autobackup/blob/bf985998b38a23d3dd6263f89d177b2310493c54/zfs_autobackup#L832)

Instead of unit tests I opted to test it using "fuzzy" timestamps at random intevals, to prevent exactly the problems you had. (this was before i found your project)

One thing i realized: DONT use relative time to define the buckets! Since then all the buckets "shift" all the time, you will get all kinds of weird and wonderfull behaviour.

Instead use absolute epoch time to identify the buckets.

e.g.: bucket_nr=int(snapshot_time_in_seconds_since_epoch/period)

This way a snapshot always stays in the same bucket, nothing is shifting. This gave very good results.

The reason i went looking at other solutions is because i have this problem:

Example: User only chooses to keep one yearly snapshot. That snapshot will be removed as soon as it is 365 days old. At that time the "next oldest" will become the yearly snapshot. This one however can be quite young. (if there are other rules used, otherwise it will be the snapshot of right now)

So i have some ideas how to solve it, but i was looking for a clever way, or a library that already solved these issues. I think looking at your approach helped me to get a better idea for a solution.

@psy0rz
Copy link

psy0rz commented Oct 22, 2019

Ok i completed the thinning-classes among with some test-code. Maybe someone can use it: https://github.com/psy0rz/zfs_autobackup/blob/34d0c5d67bd483239ad625265226327b1d7a417c/zfs_autobackup#L82

(its a permalink, also check the latest version, in the future. :)

@dvhirst
Copy link

dvhirst commented Jan 25, 2020

Is there a rule-set for the existing implementation that will safely thin out a set of backups over a multi-year period, resulting in a stable use of storage, e.g., in the cloud? The rule-set should allow for keeping 1 backup that is a year old, 12 backups that are less than 1 year old, with one per month, up to 1 year, and the 30 most recent days. I had tried in 2018 to set up a sequence like this and ran into a failure, hence my prior communications. I see no new release of the utility, so I presume that the flaw mentioned in this post still exists. I am still interested in making use of this almost highly useful utility. Thanks. Don Hirst

@dvhirst
Copy link

dvhirst commented Jul 2, 2020

Hello, I'm still following this thread. Would be extremely gratified to see an update that fixes the problem. For now, I am using this to manage a simple 30 day thinning, but would really prefer to use my initial pattern, which doesn't work with the current implementation. Thanks. Don Hirst

@dvhirst
Copy link

dvhirst commented Jan 5, 2021

2021 01 04; 21:04 PDT: Just checking in to see if there is any chance the error described above will be fixed with a new release. Would greatly appreciate such action!

@jgehrcke
Copy link
Owner Author

@dvhirst thank you for your endless interest and patience! :) I really don't know if I will ever get back to this in this current life, haha. I better do not make any promises. The passion is there -- I would love to work on this again! Let's see.

@dvhirst
Copy link

dvhirst commented Sep 13, 2021 via email

@FlashSheridan
Copy link

I’m trying to find a workaround for this in what I believe should be your base case: A set of daily backups which need thinning, but where deleting the most recent is unacceptable. In normal testing, since the daily backup is at 3 AM and I don’t get up that early, there doesn’t seem to be a way to do this. But if I schedule the cron job within an hour of the backup completing, a test with a bogus file suggests that the most recent might be preserved with a pattern containing recent1,hours1,. Does anyone have a better suggestion, or am I missing something?

I’ve noticed that there are a few forks of this project; does any of them implement protection of the last file? Should I install a version of timegaps before the change? Could I just alter the source on our disk so that it never deletes the most recent item? (Which should be the default in my opinion)

@jgehrcke
Copy link
Owner Author

DONT use relative time to define the buckets! Since then all the buckets "shift" all the time, you will get all kinds of weird and wonderfull behaviour.

I love the provoking thought here btw, @psy0rz. But I have to argue that thinning out items makes sense based on age, and age is by definition a duration relative to now.

@psy0rz
Copy link

psy0rz commented May 18, 2022

Yeah, but you look at the age of a bucket. And a bucket should always be at the same point in time, using the method i described.

I wrote quite a few regression tests for this and many people have been using it succesfully in zfs-autobackup for years now. So it works, without quirks :)

@psy0rz
Copy link

psy0rz commented May 18, 2022

@jgehrcke
Copy link
Owner Author

jgehrcke commented May 18, 2022

a bucket should always be at the same point in time

Thanks for the discussion! Hm. So. These buckets would not have a predictable relationship to the concept of 'age'. However, the user only thinks in terms of age of items. With buckets defined absolutely in time, the user would need to provide rules based on absolute time, too (they would say "keep one item pear year for the years 2000-2022" as opposed to "keep one item per year for the last 22 years", and then their rules would become out-of-date as time progresses).

Looking at the code in the Thinner.py you linked. What I see is item age being calculated relative to now:

            age = int(now) - timestamp

And then this age is later compared per-rule "ttl" which seems to be the maximum age in seconds (that would be the per-rule bucket boundary at the 'past' end of the bucket):

            for rule in self.rules:
                if age <= rule.ttl:
                    block_nr = int(timestamp / rule.period)
...

I only had this brief look, but it seems like the per-rule buckets have moving boundaries in time, relative to now (as it makes sense to me :-)).

I wrote quite a few regression tests for this

I don't mean this in a negative way, but do you have more tests than https://github.com/psy0rz/zfs_autobackup/blob/4873913fa8ff6464373e947a333087498d8b0d53/tests/test_thinner.py?

@psy0rz
Copy link

psy0rz commented May 18, 2022

i forgot to mention this: https://github.com/psy0rz/zfs_autobackup/wiki/Thinner#technical-details-about-the-thinner

Maybe that clears it up?

Those are indeed the tests, but i hand verified those results.

@psy0rz
Copy link

psy0rz commented May 18, 2022

The age is offcourse relative, but the buckets have an absolute timeframe.

@FlashSheridan
Copy link

My proposed workaround for the base case seemed to work last night, but it’s uncomfortably dependent on hard-coded assumptions about the timing. Coverity backs up its database locally at 3 AM; I rescheduled the rsync backup of the backups (to a larger remote disk) to run at 3:30 AM, and then run timegaps at 3:50 AM. Fortunately Coverity seems to take only ten minutes for its backup, and rsync about a minute, so this strategy should be fine for us for quite a while. I suspect our long-term solution is a timegaps fork synced to a change before the user interface mistake fixing the design problem.

Here are the relevant crontab pieces:

30 3 * * * /usr/bin/rsync -rz --times --recursive  …
50 3 * * * /usr/local/bin/timegaps --move  … recent3,hours3,days5,weeks8,months12,years42 …/CIM.*.backup

Until I’m confident that we’re fortunate in the timing, I will be manually deleting the destination of the --move, but of course I hope eventually to switch to --delete.

@jgehrcke
Copy link
Owner Author

jgehrcke commented May 18, 2022

the buckets have an absolute timeframe

@psy0rz this is ambiguous to me. Do you want to say that each bucket has a specific (time) width (of course it has :D and I suppose you're trying to say something else and I don't understand yet).

If the algorithm you have implemented is "keep the newest item per category" then maybe it also suffers from the conceptual difficulty described in #3.

i forgot to mention this: https://github.com/psy0rz/zfs_autobackup/wiki/Thinner#technical-details-about-the-thinner

Thanks for linking this. Love the usage of visualization, as it always helps to clarify concepts. (as you can see above I am also a great fan of visualizing things).

@FlashSheridan thanks for the thoughts. If you find a workaround that works for you -- great! For me what you write however only confirms that there's work left to do. As long as the retention 'success' depends on the timegaps invocation frequency and/or timegaps invocation time the approach is flawed (as in 'timegaps does not live up to its promise yet').

@FlashSheridan
Copy link

@FlashSheridan thanks for the thoughts. If you find a workaround that works for you -- great!

But it’s incredibly ugly, and requires bright red warning text in the internal documentation against others using the tool—which we agreed in a recent meeting (before I found timegaps) would be highly desirable.

For me what you write however only confirms that there's work left to do.

Not much work for a quick workaround for the base case — an option not to delete the most recent file. (Generalizing it to more complicated cases, #5, might be more difficult, but that’s irrelevant to the base case, which would be easy.)

@FlashSheridan
Copy link

In the meantime, I’ve switched to python-rotate-backups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants