Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (size overshoot) in storage_e2e_single_thread_rpunit::test_offset_range_size_incremental #18396

Closed
WillemKauf opened this issue May 10, 2024 · 2 comments · Fixed by #18420
Assignees
Labels
area/storage ci-failure ci-rca/test CI Root Cause Analysis - Test Issue kind/bug Something isn't working rpunit unit test ci-failure (not ducktape)

Comments

@WillemKauf
Copy link
Contributor

WillemKauf commented May 10, 2024

https://buildkite.com/redpanda/redpanda/builds/48913#018f6021-c8fb-4459-892f-6a23f87c0b59

/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0158c20326338f1d9-1/redpanda/redpanda/src/v/storage/tests/storage_e2e_test.cc(4669): fatal error: in "test_offset_range_size_incremental": critical check res->on_disk_size < max_size has failed

JIRA Link: CORE-2906

@WillemKauf WillemKauf added kind/bug Something isn't working rpunit unit test ci-failure (not ducktape) area/storage ci-failure labels May 10, 2024
@abhijat
Copy link
Contributor

abhijat commented May 13, 2024

Looking for target size 10240

t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2063 - Offset range size, first: 174271, target size: 10240/1024, lstat: {start_offset:0, committed_offset:266661, committed_offset_term:0, dirty_offset:266661, dirty_offset_term:0}

First segment contributes 10158 bytes, we still have 82 more bytes to go to reach the target:

t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2175 - First offset 174271 located at 216823, offset range size: 10158, Segment offsets: {term:0, base_offset:173474, committed_offset:174312, dirty_offset:174312}
t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2270 - Setting offset range to 174271 - 174312

Next segment size is 197738, that plus prev segment size 10158 overshoots our target size 207896-10240=197656

t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2186 - Adding 197738 bytes to the offset range. Segment offsets: {term:0, base_offset:174313, committed_offset:175018, dirty_offset:175018}, current_size: 207896
t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2206 - Offset range size overshoot by 197656, current segment size 197738, current offset range size: 207896

After looking up in the segment index for find_above_size_bytes(82) we end up adding an extra 41195 bytes.

t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2241 - Setting offset range to 174271 - 174464, 41195 bytes of the last segment are included
t=1715309066365DEBUG 2024-05-10 02:43:33,671 [shard 0:main] storage - disk_log_impl.cc:2281 - Discovered offset size: 51353, last included offset: 174463
t=1715309066365INFO  2024-05-10 02:43:33,671 [shard 0:main] storage_e2e_test - storage_e2e_test.cc:4667 - Requested 10240(1024min, 51200max) bytes, got 51353 bytes for offset 174463

The test case uses a lenient max size to accomodate for step size in the index. But the delta here (41195 bytes) is more than index step size, so we need to look at the index entries to see why this deviation is seen.

@abhijat
Copy link
Contributor

abhijat commented May 13, 2024

Examining the index sizes during a failing run of the test:

       ---------------          0       ---------------          32447
       ---------------          0       ---------------          32447       ---------------          68581
       ---------------          0       ---------------          32447       ---------------          68581       ---------------          104841
       ---------------          0       ---------------          32447       ---------------          68581       ---------------          104841       ---------------          140975
       ---------------          0       ---------------          32447       ---------------          68581       ---------------          104841       ---------------          140975       ---------------          174420
       ---------------          0       ---------------          32447       ---------------          68581       ---------------          104841       ---------------          140975       ---------------          174420       ---------------          215217
       ---------------          0       ---------------          41120

The last step causes a failure, the index entry is at 41120 bytes, much larger than 32KiB which we rely on in the test. This plus the content of the previous segment overshoots the 50KiB limit we set for failure.

@abhijat abhijat changed the title CI Failure (key symptom) in storage_e2e_single_thread_rpunit CI Failure (size overshoot) in storage_e2e_single_thread_rpunit::test_offset_range_size_incremental May 13, 2024
@piyushredpanda piyushredpanda added the ci-rca/test CI Root Cause Analysis - Test Issue label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure ci-rca/test CI Root Cause Analysis - Test Issue kind/bug Something isn't working rpunit unit test ci-failure (not ducktape)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants