High Memory Usage/ LRU cache size is not being respected #12579

zaidoon1 · 2024-04-24T18:19:14Z

I've set the LRU cache to 1.5gb for the "url" cf. However, all of the sudden, the service that runs rocksdb hit the max memory limit I allocated for the service and I can see that the LRU cache for the "url" cf hit that limit:

This also caused the service to max out the cpu usage (likely because of back pressure).

flamegraph:

zaidoon1 · 2024-04-25T08:23:27Z

at least in the above instance, I have something "obvious" to blame, this happened again on a different machine. Except, this time, it doesn't look like we exceeded the configured LRU cache size by much and yet, rocksdb still used up all memory:

zaidoon1 · 2024-04-29T07:11:35Z

@ajkr any idea what could have happened here in both cases? I guess the easiest one to answer is how/why rocksdb went above the allocated LRU cache size? Unfortunately, I don't have any other LOGs to share because of the issues described here: #12584 (nothing showed up in the WARN level logs so I don't know what was happening at the time)

zaidoon1 · 2024-05-01T15:25:01Z

I was thinking of using strict LRU capacity but it looks like reads (and writes?) will fail if the capacity is hit which is not expected. Why don't we evict from cache instead of failing new reads?

zaidoon1 · 2024-05-02T14:58:00Z

Here is more data:

looks like it happens when we have lots of tombstones. This appears to match what was happening in #2952 although the issue there was due to some compaction bug. I'm wondering if there is another compaction bug at play here.

ajkr · 2024-05-02T20:09:43Z

What allocator are you using? RocksDB tends to perform poorly with glibc malloc, and better with allocators like jemalloc, which is what we use internally. Reference: https://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html

Why don't we evict from cache instead of failing new reads?

We evict from cache as long as we can find clean, unpinned entries to evict. Block cache only contains dirty entries when WriteBufferManager is configured to charge its memory to block cache, and only contains pinned entries according to the BlockBasedTableOptions.

That said, we try to evict from cache even if you don't set strict LRU capacity. That setting is to let you choose the behavior in cases where there is nothing evictable - fail the operation (strict), or allocate more memory (non-strict).

zaidoon1 · 2024-05-02T20:34:05Z

What allocator are you using

I'm using jemalloc for the allocator (i've double checked this).

In the last instance this happened (screenshot above), block cache was not maxing out beyond what is configured so I don't think that's the issue. I started seeing this issue happen when I enabled the part of the system that does the "is there an index that matches prefix x check" which is prefix seek that only looks at the first kv returned. From the last graph i posted, it also appears to happen when there is a lot of tombstones so the seek + tombstones combination is very odd/suspect to me (similar to the problem reported in the rocksdb ticket i linked to). Right now, i'm doing a load test, so I'm sending 5K requests with unique prefixes and the prefixes are guaranteed to not finding any matching kv

ajkr · 2024-05-03T01:07:47Z

Thanks for the details. Are the 5K requests in parallel? Does your memory budget allow indexes to be pinned in block cache (unpartitioned_pinning = PinningTier::kAll, or less preferably but still good, unpartitioned_pinning = PinningTier::kFlushedAndSimilar)? From the CPU profile it shows most of the work is decompressing index blocks, and that work might even be redundant in which case the memory consumed by index data would be amplified.

zaidoon1 · 2024-05-03T01:22:05Z

Are the 5K requests in parallel?
yes they are
I've enabled cache_index_and_filter_blocks, everything else is whatever the default is (I'll need to check the default for unpartitioned_pinning)
It just happened again and here is what it looked like

and here is a pprof of what is happening to make it easier to see what rocksdb is doing

zaidoon1 · 2024-05-03T04:08:11Z

Also here is the db options I have configured:
db_options.txt

ajkr · 2024-05-03T06:23:28Z

and that work might even be redundant in which case the memory consumed by index data would be amplified.

For more details on this problem, see the stats added in #6681. It looks like you have statistics enabled so you might be able to check those stats to confirm or rule out whether that is the problem.

If it is the problem, unfortunately I don't think we have a good solution yet. That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing. Changing cache_index_and_filter_blocks to false could have a similar effect.

zaidoon1 · 2024-05-03T06:39:58Z

ok this is good to know, i'll definitely investigate this part. I would like to confirm, If we assume that's the problem, then my options are:

set cache_index_and_filter_blocks to false
keep cache_inex_and_filter_blocks set to true and also set unpartitioned_pinning = PinningTier::kAll? Or should I set cache_inex_and_filter_blocks to false and set unpartitioned_pinning?

I think the main reason I set cache_index_and_filter_blocks to true is to cap/control memory usage (but also that's when i thought i had jemalloc enabled but it wasn't so my issues at the time could be different).

That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

Is it any different/better to use WriteBufferManager to control memory usage vs cache_index_and_filter_blocks ?

ajkr · 2024-05-04T00:13:40Z

regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing?

There is a property: TableProperties::index_size. It's available via DB APIs like GetPropertiesOfAllTables(), or externally to the DB via sst_dump on the DB dir. It isn't exactly the same as the memory cost of holding an index block in memory but I think it should give an upper bound.

zaidoon1 · 2024-05-04T04:43:51Z

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

ajkr · 2024-05-04T19:08:44Z

cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false?

It is preferable if you want to use our block cache capacity setting for limiting RocksDB's total memory usage.

cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll: index block memory usage counts towards block cache capacity. Pinning prevents potential thrashing.
cache_index_and_filter_blocks=false: Index block memory usage counts toward table reader memory, which is not counted towards block cache capacity by default. Potential thrashing is still prevented because they are preloaded and non-evictable in table reader memory.

zaidoon1 · 2024-05-04T19:14:07Z

great! Thanks for confirming, once the c api changes land I'll experiment with this and report back

zaidoon1 · 2024-05-05T01:22:03Z

A few other questions that just came to my mind:

right now, i'm using a prefix extractor with a prefix bloom (ribbon filter, 10.0 ratio). The 5K prefix lookups per second is for kvs that don't exist, even the prefix wouldn't exist. I expect the ribbon filter to detect this and therefore rocksdb just skips doing any work. Given this, would we still have the thrashing issue?
would https://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html help in my case?

ajkr · 2024-05-06T23:22:04Z

Yes, prefix filter should prevent thrashing for index block lookups. I didn't notice earlier that it's already enabled. Then, it's surprising that BinarySearchIndexReader::NewIterator() is consuming most of the CPU. Do you set ReadOptions::total_order_seek for the iterator? That can defeat the prefix filter optimizations.

zaidoon1 · 2024-05-07T02:43:59Z

Do you set ReadOptions::total_order_seek for the iterator

I don't, unless that is set by default in rocksdb under the hood? In the rust library, I call prefix_iterator_cf which just sets prefix_same_as_start and then does the lookup: https://github.com/zaidoon1/rust-rocksdb/blob/567825463480b75f733b73f01c1bd05990aea5b9/src/db.rs#L1438-L1446

Maybe I should start by looking at the ribbon filter metrics? is there a specific metric I should be looking at to see if things are working as they should?

zaidoon1 · 2024-05-07T02:56:32Z

I found the following:

rocksdb.bloom.filter.useful COUNT
rocksdb.bloom.filter.full.positive COUNT
rocksdb.bloom.filter.full.true.positive COUNT
rocksdb.bloom.filter.prefix.checked COUNT
rocksdb.bloom.filter.prefix.useful COUNT
rocksdb.bloom.filter.prefix.true.positive COUNT

I couldn't find anything specific to ribbon filter so my guess is "bloom" filter would also be populated for ribbon fiter, if so, which would be the most useful for me to add a metric for to track this issue?

or maybe seek stats:

rocksdb/include/rocksdb/statistics.h

Lines 457 to 481 in 36ab251

    
           // Statistics on iterator Seek() (and variants) for each sorted run. I.e. a 
        
           // single user Seek() can result in many sorted run Seek()s. 
        
           // The stats are split between last level and non-last level. 
        
           // Filtered: a filter such as prefix Bloom filter indicate the Seek() would 
        
           // not find anything relevant, so avoided a likely access to data+index 
        
           // blocks. 
        
           LAST_LEVEL_SEEK_FILTERED, 
        
           // Filter match: a filter such as prefix Bloom filter was queried but did 
        
           // not filter out the seek. 
        
           LAST_LEVEL_SEEK_FILTER_MATCH, 
        
           // At least one data block was accessed for a Seek() (or variant) on a 
        
           // sorted run. 
        
           LAST_LEVEL_SEEK_DATA, 
        
           // At least one value() was accessed for the seek (suggesting it was useful), 
        
           // and no filter such as prefix Bloom was queried. 
        
           LAST_LEVEL_SEEK_DATA_USEFUL_NO_FILTER, 
        
           // At least one value() was accessed for the seek (suggesting it was useful), 
        
           // after querying a filter such as prefix Bloom. 
        
           LAST_LEVEL_SEEK_DATA_USEFUL_FILTER_MATCH, 
        
           // The same set of stats, but for non-last level seeks. 
        
           NON_LAST_LEVEL_SEEK_FILTERED, 
        
           NON_LAST_LEVEL_SEEK_FILTER_MATCH, 
        
           NON_LAST_LEVEL_SEEK_DATA, 
        
           NON_LAST_LEVEL_SEEK_DATA_USEFUL_NO_FILTER, 
        
           NON_LAST_LEVEL_SEEK_DATA_USEFUL_FILTER_MATCH,

? Not sure which would help me figure out what i need

ajkr · 2024-05-08T00:54:56Z

Looks like the *LEVEL_SEEK* statistics are for iterators. The *FILTERED vs. *DATA can tell you the filtering rate. If you have a lot of other iterations happening, it could be hard to attribute the metric values to prefix checks vs. other things. Though if *LEVEL_SEEK*FILTERED stats are zero, that'd tell us a lot.

If you want to measure an operation's stats in isolation, we have PerfContext instead of Statistics for that: https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context

zaidoon1 · 2024-05-08T02:11:12Z

something that I'm wanting to make sure of, is the *LEVEL_SEEK* statistics a db level stats or is it per cf? As far as iterators go, I only use them when doing the prefix check. Other operations use multiget

ajkr · 2024-05-08T02:32:21Z

It's as wide a scope as the Statistics object, which is at minimum one DB since it's configured in DBOptions::statistics. It could be multiple DBs if you configure multiple DBs to use the same object.

zaidoon1 · 2024-05-08T05:46:53Z

great! thanks for confirming, I'm going to track:

LAST_LEVEL_SEEK_FILTERED,
LAST_LEVEL_SEEK_DATA,
NON_LAST_LEVEL_SEEK_FILTERED,
NON_LAST_LEVEL_SEEK_DATA,

and will report back what I see

zaidoon1 · 2024-05-13T23:28:03Z

so while the issue didn't happen again, I'm still waiting for it, I think I've narrowed down the part of the system that causes this.

So I initially told you that we do a prefix check to see whether there exists a key in rocksdb that starts with some prefix x that gets provided by some external system. Most of the time, there is none, so bloom filter does the job. Now, the part I forgot to mention which is very relevant is that in the event there exists a key, we store this prefix key in the "queue" cf. A background services then iterates over each key in the queue cf, and fetches all matching kvs by prefix. After adding metrics, we see that one prefix can match 3M kvs. Once we find those keys, we delete the file that is referenced by the kvs and eventually delete the kvs.

The calls we do is basically:

     let kv_iterator = if starting_index_key.is_empty() {
            self.db.prefix_iterator_cf(&cf_handle, prefix)
        } else {
            self.db.prefix_iterator_cf(&cf_handle, starting_index_key)
        };

We get keys in batches of 1K to process them that's why we either start from the first key that matches the prefix or continue from where we last left of.

Given this background, I'm thinking this is definitely causing the thrasing issue as we are iterating over millions of keys. Given that the intention is to then delete those keys, maybe I should disable caching while iterating over those keys so that rocksdb doesn't try to cache those lookups as they are useless?

to be specific, i'm thinking of setting:

rocksdb_readoptions_set_fill_cache

to false before iterating. What do you think @ajkr ?

dfa1 · 2024-05-15T07:29:32Z

we have the same issue: recently we moved from rocksdb 7.x + CLOCK_CACHE to rocksdb 8.x + LRU_CACHE . The limit (3GB) is not respected at all: the process is keeping allocating memory until it receives OOM.

Our setup is: rocksdbjni 8.x + range scan workload

dfa1 · 2024-05-15T12:33:50Z

@zaidoon1 FYI: switching to HYPER_CLOCK_CACHE fixed the memory issue in our case. Maybe it is a valid workaround for you too (but we were using CLOCK_CACHE in the rocksdb 7.x)

LRU_CACHE

HYPER_CLOCK_CACHE

zaidoon1 · 2024-05-16T02:41:15Z

Thanks! That's interesting to know, maybe clock cache is more resistant to thrashing? @ajkr any idea? Or it could be your issue is different than mine. In my case, I'm pretty sure it's the iterators that are reading 1M+ kvs and disabling caching for those should help with that.

In general, I plan on switching to hyper clock cache once the auto tunning pparamter of hyper clock cache

rocksdb/include/rocksdb/cache.h

Line 380 in 4eaf628

// * Requires an extra tuning parameter: see estimated_entry_charge below.

is marked production ready. I'd would rather not have to play around with parameters too much and let rocksdb figure it out.

dfa1 · 2024-05-16T05:21:49Z

@zaidoon1 you're welcome! :-) Anyway, we are dealing with smaller range scan (up to few thousands) with caching enabled and 0 as estimated_entry_charge is working fine (maybe it could be better in 9.x)

ajkr · 2024-05-16T15:57:24Z

to be specific, i'm thinking of setting:
rocksdb_readoptions_set_fill_cache
to false before iterating. What do you think @ajkr ?

Sorry I'm not sure. It sounds like a generally good idea to use fill_cache=false for a scan-and-delete workload because the scanned blocks will not be useful for future user reads.

But, I am not sure how much it will help with this specific problem. The CPU profile is mostly showing index blocks, which leads me to think there is something special about those. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed.

zaidoon1 · 2024-05-16T20:04:14Z

. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed

I don't think i'm pinning them (I have not set unpartitioned_pinning = PinningTier::kAll since the c api change didn't land yet in the rocksdb release, everything else is default which I don't think enables pinning?). I'm setting cache_index_and_filter_blocks to true which should be just caching them but if as I scan and fill the cache with useless data, I expect to be kicking things out including index and filter blocks so this issues would be seen? Or am I misunderstanding how this works?

ajkr · 2024-05-16T21:07:53Z

Oh ok if you're not pinning them then fill_cache will make a difference on index blocks. Whether it's a positive or negative difference, I don't know. The reason it could be negative is if 5k iterators simultaneously load the same index block without putting it in block cache, the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache. But let's see.

zaidoon1 · 2024-05-17T00:53:32Z

the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache

Would this increased memory exceed the allocated memory? I assume this is not something that is capped by block cache since the reads are not getting cached?

ajkr · 2024-05-17T16:09:44Z

Right, it won't be capped by block cache.

zaidoon1 · 2024-05-21T23:43:41Z

So this happened again which means the fill cache idea I had didn't work. Here is the values of the metrics I added from #12579 (comment) as well as other metrics/state of things

Notice that the filtering is doing a great job and pretty much filtering out all seeks.

At this point, I don't have any more ideas, is there any other metrics/stats you suggest I add? The next step I think is waiting for the next rocksdb release that will let me set via c api:

cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll

or did this solution also become invalid given that we are filtering out most seeks? @ajkr What do you think?

Also looking at the graphs above, is it weird that block cache memory usage didn't go up at all even though index blocks end up in block cache as they are read so I would expect block cache to also spike? Or maybe I'm misunderstanding that? For example, my first occurrence of this, block cache matched the total memory usage: #12579 (comment) but then other times/most times, it doesn't and block cache is fine but total memory usage spikes to the max.

zaidoon1 · 2024-05-22T01:08:53Z

can point lookups also cause this issue or is this unique to seeks/prefix lookups? During the entire duration where memory was maxed out, the endpoint that performs point lookups was spiking in requests:

This endpoint does two things:

multiget a bunch of keys from the queue cf, however, the queue cf is empty 99% of the time since we process as fast as possible so I don't think that's it (unless something heavy still happens even when the cf is empty?)
get the value of a key from the key cf using RocksDB PinnableSlice. Can this be doing something bad?

ajkr · 2024-05-22T02:30:08Z

What are the data sources for "Number of SST Files" and "FDs" metric? I previously assumed they would be similar but in the most recent charts I realized there's 15-20 SST files but up to 1K FDs.

zaidoon1 · 2024-05-22T03:16:39Z

fds metric comes from : container_file_descriptors{app_name="app_running_rocksdb"}
Number of SST Files: count(rocksdb_livefiles)

zaidoon1 · 2024-05-22T20:35:15Z

is it possible that it's a prefetching issue? over 90% of the prefix lookups are for keys that don't exist, but for the ones that do, when I do my prefix check using a prefix iterator, does the iterator try to fetch more data in anticipation that I would read this data and thus I end up pulling more data than I need? If there is prefetching happening, is there a way to limit this since I only care about the first kv? Would lowering the readahead size help here?

zaidoon1 · 2024-05-23T00:02:06Z

is there another metric I should add that would at least show the problem? For example, without getting a flamegraph, and by just looking at the existing metrics I'm tracking, we wouldn't know that we are doing any lookups as it looks like everything is being filtered out by the filter setup but this is not true. It feels like there is an observability gap?

zaidoon1 · 2024-05-23T05:07:38Z

@ajkr So I added more metrics and here is what they look like when the issue happens:

a few questions:

does it make sense that block cache usage is low but memory usage is maxed out?
I added more seek related metrics, none look interesting to me but maybe something sticks out to you?
I added a metric for block cache index redundant add per your suggestion, I can see that it's happening constantly but at a very low rate so I'm not sure if this rate is enough to push the memory usage usage from the typical 2-3gb to the max allocated memory of 10gb and even then, I would expect us to max out the block cache, right?

zaidoon1 changed the title ~~LRU cache size is not being respected~~ High Memory Usage/ LRU cache size is not being respected Apr 25, 2024

High Memory Usage/ LRU cache size is not being respected #12579

High Memory Usage/ LRU cache size is not being respected #12579

Comments

zaidoon1 commented Apr 24, 2024

zaidoon1 commented Apr 25, 2024

zaidoon1 commented Apr 29, 2024

zaidoon1 commented May 1, 2024

zaidoon1 commented May 2, 2024

ajkr commented May 2, 2024

zaidoon1 commented May 2, 2024 • edited

ajkr commented May 3, 2024 • edited

zaidoon1 commented May 3, 2024

zaidoon1 commented May 3, 2024

ajkr commented May 3, 2024

zaidoon1 commented May 3, 2024

ajkr commented May 4, 2024

zaidoon1 commented May 4, 2024

ajkr commented May 4, 2024

zaidoon1 commented May 4, 2024

zaidoon1 commented May 5, 2024

ajkr commented May 6, 2024

zaidoon1 commented May 7, 2024

zaidoon1 commented May 7, 2024 • edited

ajkr commented May 8, 2024

zaidoon1 commented May 8, 2024 • edited

ajkr commented May 8, 2024

zaidoon1 commented May 8, 2024

zaidoon1 commented May 13, 2024 • edited

dfa1 commented May 15, 2024

dfa1 commented May 15, 2024 • edited

zaidoon1 commented May 16, 2024

dfa1 commented May 16, 2024

ajkr commented May 16, 2024

zaidoon1 commented May 16, 2024 • edited

ajkr commented May 16, 2024

zaidoon1 commented May 17, 2024

ajkr commented May 17, 2024

zaidoon1 commented May 21, 2024 • edited

zaidoon1 commented May 22, 2024 • edited

ajkr commented May 22, 2024

zaidoon1 commented May 22, 2024

zaidoon1 commented May 22, 2024

zaidoon1 commented May 23, 2024 • edited

zaidoon1 commented May 23, 2024 • edited

zaidoon1 commented May 2, 2024 •

edited

ajkr commented May 3, 2024 •

edited

zaidoon1 commented May 7, 2024 •

edited

zaidoon1 commented May 8, 2024 •

edited

zaidoon1 commented May 13, 2024 •

edited

dfa1 commented May 15, 2024 •

edited

zaidoon1 commented May 16, 2024 •

edited

zaidoon1 commented May 21, 2024 •

edited

zaidoon1 commented May 22, 2024 •

edited

zaidoon1 commented May 23, 2024 •

edited

zaidoon1 commented May 23, 2024 •

edited