New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Memory Usage/ LRU cache size is not being respected #12579
Comments
@ajkr any idea what could have happened here in both cases? I guess the easiest one to answer is how/why rocksdb went above the allocated LRU cache size? Unfortunately, I don't have any other LOGs to share because of the issues described here: #12584 (nothing showed up in the WARN level logs so I don't know what was happening at the time) |
I was thinking of using strict LRU capacity but it looks like reads (and writes?) will fail if the capacity is hit which is not expected. Why don't we evict from cache instead of failing new reads? |
looks like it happens when we have lots of tombstones. This appears to match what was happening in #2952 although the issue there was due to some compaction bug. I'm wondering if there is another compaction bug at play here. |
What allocator are you using? RocksDB tends to perform poorly with glibc malloc, and better with allocators like jemalloc, which is what we use internally. Reference: https://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html
We evict from cache as long as we can find clean, unpinned entries to evict. Block cache only contains dirty entries when That said, we try to evict from cache even if you don't set strict LRU capacity. That setting is to let you choose the behavior in cases where there is nothing evictable - fail the operation (strict), or allocate more memory (non-strict). |
I'm using jemalloc for the allocator (i've double checked this). In the last instance this happened (screenshot above), block cache was not maxing out beyond what is configured so I don't think that's the issue. I started seeing this issue happen when I enabled the part of the system that does the "is there an index that matches prefix x check" which is prefix seek that only looks at the first kv returned. From the last graph i posted, it also appears to happen when there is a lot of tombstones so the seek + tombstones combination is very odd/suspect to me (similar to the problem reported in the rocksdb ticket i linked to). Right now, i'm doing a load test, so I'm sending 5K requests with unique prefixes and the prefixes are guaranteed to not finding any matching kv |
Thanks for the details. Are the 5K requests in parallel? Does your memory budget allow indexes to be pinned in block cache ( |
Also here is the db options I have configured: |
For more details on this problem, see the stats added in #6681. It looks like you have statistics enabled so you might be able to check those stats to confirm or rule out whether that is the problem. If it is the problem, unfortunately I don't think we have a good solution yet. That is why I was wondering if you have enough memory to pin the indexes so they don't risk thrashing. Changing |
ok this is good to know, i'll definitely investigate this part. I would like to confirm, If we assume that's the problem, then my options are:
I think the main reason I set cache_index_and_filter_blocks to true is to cap/control memory usage (but also that's when i thought i had jemalloc enabled but it wasn't so my issues at the time could be different).
regarding this part, is there a way/formula to know how much memory it will cost to pin the indexes? Or is this a try and find out kind of thing? Is it any different/better to use WriteBufferManager to control memory usage vs cache_index_and_filter_blocks ? |
There is a property: |
cool, i'll check this out and just to double check, is unpartitioned_pinning = PinningTier::kAll more prefered than setting cache_index_and_filter_blocks to false? |
It is preferable if you want to use our block cache capacity setting for limiting RocksDB's total memory usage.
|
great! Thanks for confirming, once the c api changes land I'll experiment with this and report back |
A few other questions that just came to my mind:
|
Yes, prefix filter should prevent thrashing for index block lookups. I didn't notice earlier that it's already enabled. Then, it's surprising that |
I don't, unless that is set by default in rocksdb under the hood? In the rust library, I call Maybe I should start by looking at the ribbon filter metrics? is there a specific metric I should be looking at to see if things are working as they should? |
I found the following:
I couldn't find anything specific to ribbon filter so my guess is "bloom" filter would also be populated for ribbon fiter, if so, which would be the most useful for me to add a metric for to track this issue? or maybe seek stats: rocksdb/include/rocksdb/statistics.h Lines 457 to 481 in 36ab251
|
Looks like the If you want to measure an operation's stats in isolation, we have |
something that I'm wanting to make sure of, is the |
It's as wide a scope as the |
great! thanks for confirming, I'm going to track:
and will report back what I see |
so while the issue didn't happen again, I'm still waiting for it, I think I've narrowed down the part of the system that causes this. So I initially told you that we do a prefix check to see whether there exists a key in rocksdb that starts with some prefix x that gets provided by some external system. Most of the time, there is none, so bloom filter does the job. Now, the part I forgot to mention which is very relevant is that in the event there exists a key, we store this prefix key in the "queue" cf. A background services then iterates over each key in the queue cf, and fetches all matching kvs by prefix. After adding metrics, we see that one prefix can match 3M kvs. Once we find those keys, we delete the file that is referenced by the kvs and eventually delete the kvs. The calls we do is basically:
We get keys in batches of 1K to process them that's why we either start from the first key that matches the prefix or continue from where we last left of. Given this background, I'm thinking this is definitely causing the thrasing issue as we are iterating over millions of keys. Given that the intention is to then delete those keys, maybe I should disable caching while iterating over those keys so that rocksdb doesn't try to cache those lookups as they are useless? to be specific, i'm thinking of setting:
to false before iterating. What do you think @ajkr ? |
we have the same issue: recently we moved from rocksdb 7.x + CLOCK_CACHE to rocksdb 8.x + LRU_CACHE . The limit (3GB) is not respected at all: the process is keeping allocating memory until it receives OOM. Our setup is: rocksdbjni 8.x + range scan workload |
@zaidoon1 FYI: switching to HYPER_CLOCK_CACHE fixed the memory issue in our case. Maybe it is a valid workaround for you too (but we were using CLOCK_CACHE in the rocksdb 7.x) |
Thanks! That's interesting to know, maybe clock cache is more resistant to thrashing? @ajkr any idea? Or it could be your issue is different than mine. In my case, I'm pretty sure it's the iterators that are reading 1M+ kvs and disabling caching for those should help with that. In general, I plan on switching to hyper clock cache once the auto tunning pparamter of hyper clock cache rocksdb/include/rocksdb/cache.h Line 380 in 4eaf628
|
@zaidoon1 you're welcome! :-) Anyway, we are dealing with smaller range scan (up to few thousands) with caching enabled and 0 as estimated_entry_charge is working fine (maybe it could be better in 9.x) |
Sorry I'm not sure. It sounds like a generally good idea to use fill_cache=false for a scan-and-delete workload because the scanned blocks will not be useful for future user reads. But, I am not sure how much it will help with this specific problem. The CPU profile is mostly showing index blocks, which leads me to think there is something special about those. If you are pinning index blocks in cache, the index blocks will already be cached so the logic related to fill_cache is bypassed. |
I don't think i'm pinning them (I have not set unpartitioned_pinning = PinningTier::kAll since the c api change didn't land yet in the rocksdb release, everything else is default which I don't think enables pinning?). I'm setting cache_index_and_filter_blocks to true which should be just caching them but if as I scan and fill the cache with useless data, I expect to be kicking things out including index and filter blocks so this issues would be seen? Or am I misunderstanding how this works? |
Oh ok if you're not pinning them then fill_cache will make a difference on index blocks. Whether it's a positive or negative difference, I don't know. The reason it could be negative is if 5k iterators simultaneously load the same index block without putting it in block cache, the memory usage could be a bit higher than if they had accessed a shared copy of that block in block cache. But let's see. |
Would this increased memory exceed the allocated memory? I assume this is not something that is capped by block cache since the reads are not getting cached? |
Right, it won't be capped by block cache. |
So this happened again which means the fill cache idea I had didn't work. Here is the values of the metrics I added from #12579 (comment) as well as other metrics/state of things Notice that the filtering is doing a great job and pretty much filtering out all seeks. At this point, I don't have any more ideas, is there any other metrics/stats you suggest I add? The next step I think is waiting for the next rocksdb release that will let me set via c api:
or did this solution also become invalid given that we are filtering out most seeks? @ajkr What do you think? Also looking at the graphs above, is it weird that block cache memory usage didn't go up at all even though index blocks end up in block cache as they are read so I would expect block cache to also spike? Or maybe I'm misunderstanding that? For example, my first occurrence of this, block cache matched the total memory usage: #12579 (comment) but then other times/most times, it doesn't and block cache is fine but total memory usage spikes to the max. |
What are the data sources for "Number of SST Files" and "FDs" metric? I previously assumed they would be similar but in the most recent charts I realized there's 15-20 SST files but up to 1K FDs. |
fds metric comes from : |
is it possible that it's a prefetching issue? over 90% of the prefix lookups are for keys that don't exist, but for the ones that do, when I do my prefix check using a prefix iterator, does the iterator try to fetch more data in anticipation that I would read this data and thus I end up pulling more data than I need? If there is prefetching happening, is there a way to limit this since I only care about the first kv? Would lowering the readahead size help here? |
is there another metric I should add that would at least show the problem? For example, without getting a flamegraph, and by just looking at the existing metrics I'm tracking, we wouldn't know that we are doing any lookups as it looks like everything is being filtered out by the filter setup but this is not true. It feels like there is an observability gap? |
@ajkr So I added more metrics and here is what they look like when the issue happens: a few questions:
|
options file:
OPTIONS.txt
I've set the LRU cache to 1.5gb for the "url" cf. However, all of the sudden, the service that runs rocksdb hit the max memory limit I allocated for the service and I can see that the LRU cache for the "url" cf hit that limit:
This also caused the service to max out the cpu usage (likely because of back pressure).
flamegraph:
The text was updated successfully, but these errors were encountered: