Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test flake: TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling #8070

Open
seizethedave opened this issue May 6, 2024 · 6 comments
Assignees

Comments

@seizethedave
Copy link
Contributor

seizethedave commented May 6, 2024

Describe the bug

This test failed for me in github CI and succeeded on rerun. I was working on something unrelated in a branch whose parent commit on main was 5e93125.

19:19:22 Started distributor
=== RUN   TestDistributor/caching_unmarshal_data_disabled/wrong_labels_order
=== RUN   TestDistributor/caching_unmarshal_data_disabled/dropped_labels
19:19:23 querier: ts=2024-05-06T19:19:23.811704501Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user
=== RUN   TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
    distributor_test.go:361: 
        	Error Trace:	/home/runner/work/mimir/mimir/integration/distributor_test.go:361
        	Error:      	Not equal: 
        	            	expected: "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[[171](https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591?pr=8068#step:15:172)5019559]"
        	            	actual  : "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	 histogram_down_scaling_series =>
        	            	-Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[1715019559]
        	            	+Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]
        	Test:       	TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
@krajorama krajorama self-assigned this May 9, 2024
@krajorama
Copy link
Contributor

The code that does the down-scaling is deterministic, so the only thing I can think of is the loading of the runtime config that may go wrong. I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active? https://github.com/grafana/dskit/blob/a1bba1277f06b41ff14cf54e5d4d31aebc92493b/runtimeconfig/manager.go#L209

@krajorama
Copy link
Contributor

Reproduced with this diff:

diff --git a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
index 84b69de76..e43c136a5 100644
--- a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
+++ b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
@@ -206,6 +206,7 @@ func (om *Manager) loadConfig() error {
        }
        om.configLoadSuccess.Set(1)
 
+       time.Sleep(1 * time.Second) // give listeners time to register
        om.setConfig(cfg)
        om.callListeners(cfg)

@seizethedave
Copy link
Contributor Author

seizethedave commented May 13, 2024

I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active?

I think you are onto it. I see two possibilities:

  1. The single goroutine per module isn't done processing the updated config.
  2. (a further symptom of number 1) If the reader side is behind on processing updates, the write side will discard an update.

These would cause inconsistent updates among listeners, but I don't think these would cause the tests to fail, because I don't think the listeners are in the "update path" of runtime config values like native_histograms_ingestion_enabled.

@seizethedave
Copy link
Contributor Author

{waking up on this Monday morning 😸.}

But the race with the second of sleep you highlighted would also cause it.

@seizethedave
Copy link
Contributor Author

I guess your second of sleep just takes longer than the tests care to wait for the new config to be available. 1 second, while an eternity, could also be a source of flakes. I'll just open a PR to crank the wait time to something like ten seconds.

@seizethedave
Copy link
Contributor Author

Actually, I don't think that was the problem. Here's the run that failed: https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591

test.Poll will t.Fatal if the condition isn't met after the 1 second, and that didn't happen on that run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants