[Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270

kuujis · 2024-03-12T11:08:00Z

introduce new flag with correct behavior [Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270
change that flag to be on by default in the next release (1.57)
remove the flag and the old code in the following release (1.58)

(bug) When service has rate limiting strategy at the service level, and defines no per-operation strategies, the global per-operation strategies were also ignored
(change) When service has probabilistic service-level strategy, and defines no per-operation strategies, the global per-operation strategies were used, but the default sampling rate was taken from global settings, whereas arguably service-level setting should take priority.

Example: given this config file where services A and B do not define their own per-operation strategies:

jaeger/plugin/sampling/strategystore/static/fixtures/service_no_per_operation.json

Lines 1 to 25 in 34f6a25

    
           { 
        
                "service_strategies": [ 
        
                  { 
        
                    "service": "ServiceA", 
        
                    "type": "probabilistic", 
        
                    "param": 1.0 
        
                  }, 
        
                  { 
        
                   "service": "ServiceB", 
        
                   "type": "ratelimiting", 
        
                   "param": 3 
        
                 } 
        
                ], 
        
                "default_strategy": { 
        
                  "type": "probabilistic", 
        
                  "param": 0.2, 
        
                  "operation_strategies": [ 
        
                    { 
        
                      "operation": "/health", 
        
                      "type": "probabilistic", 
        
                      "param": 0.0 
        
                    } 
        
                  ] 
        
                } 
        
              }

We expect ServiceA to have the following strategy:

it inherits per-operation strategies from global config
except for defaultSamplingProbability which is kept the same 1.0 as at the service-level

{
  "probabilisticSampling": {
    "samplingRate": 1
  },
  "operationSampling": {
    "defaultSamplingProbability": 1,
    "perOperationStrategies": [
      {
        "operation": "/health",
        "probabilisticSampling": {
          "samplingRate": 0.1
        }
      }
    ]
  }
}

We expect ServiceB to have the following strategy:

it inherits per-operation strategies from global config as is

{
  "strategyType": 1,
  "rateLimitingSampling": {
    "maxTracesPerSecond": 3
  },
  "operationSampling": {
    "defaultSamplingProbability": 0.2,
    "perOperationStrategies": [
      {
        "operation": "/health",
        "probabilisticSampling": {
          "samplingRate": 0.1
        }
      }
    ]
  }
}

Original ticket below:

What happened?

While investigating open-telemetry/opentelemetry-java#5504 I was not able to identify any issues with handling of the responses received from jaeger for sampling strategies related to a given service name. However - I found that if service sampling strategy is "ratelimiting" then default operation sampling strategies were not being returned.

Steps to reproduce

Run plugin\sampling\strategystore\static\strategy_store_test.go\TestServiceNoPerOperationStrategies unit test after replacing its code with the snippet below.
The strategy store file used implies that ServiceB should have probablistic operation sampling of 0.0 for operation "/health", which is not the case, as there are no associated operation sampling strategies for ServiceB.

Expected behavior

As implied in https://www.jaegertracing.io/docs/1.28/sampling/#collector-sampling-configuration I would expect the following changes to plugin\sampling\strategystore\static\strategy_store_test.go\TestServiceNoPerOperationStrategies unit test case should result in a passing test:

func TestServiceNoPerOperationStrategies(t *testing.T) {
	store, err := NewStrategyStore(Options{StrategiesFile: "fixtures/service_no_per_operation.json"}, zap.NewNop())
	require.NoError(t, err)

	s, err := store.GetSamplingStrategy(context.Background(), "ServiceA")
	require.NoError(t, err)
	require.NotNil(t, s.OperationSampling)
	os := s.OperationSampling
	assert.EqualValues(t, 1, os.DefaultSamplingProbability)
	require.Len(t, os.PerOperationStrategies, 1)
	assert.Equal(t, "/health", os.PerOperationStrategies[0].Operation)
	assert.EqualValues(t, 0.0, os.PerOperationStrategies[0].ProbabilisticSampling.SamplingRate)
	expected := makeResponse(api_v2.SamplingStrategyType_PROBABILISTIC, 1.0)
	assert.Equal(t, *expected.ProbabilisticSampling, *s.ProbabilisticSampling)

	s, err = store.GetSamplingStrategy(context.Background(), "ServiceB")
	require.NoError(t, err)
	require.NotNil(t, s.OperationSampling)
	os = s.OperationSampling
	assert.EqualValues(t, 0.2, os.DefaultSamplingProbability)
	require.Len(t, os.PerOperationStrategies, 1)
	assert.Equal(t, "/health", os.PerOperationStrategies[0].Operation)
	assert.EqualValues(t, 0.0, os.PerOperationStrategies[0].ProbabilisticSampling.SamplingRate)
	expected = makeResponse(api_v2.SamplingStrategyType_RATE_LIMITING, 3)
	assert.Equal(t, *expected.RateLimitingSampling, *s.RateLimitingSampling)
}

Relevant log output

N/A

Screenshot

N/A

Additional context

I believe this behaviour was introduced as part of #2230 as an outcome of #2171 discussion.
this issue is created as per contribution guidelines.
I'm working on setting up a test to verify the fix I made (kuujis#1) with some basic service setup and work through other requirements (first time contributing).

I would also appreciate feedback on the solution itself, since depending on how one looks at it - it might be considered breaking contract, especially since the code was not changed for a while.
The alternative I was considering, but decided against was to call jaeger twice from OTEL library in order to get both "service" specific and the "default" configurations and then merge them, but it felt wrong, since I felt this is incorrect behavior on Jager side.

Jaeger backend version

v1.35.0

SDK

io.opentelemetry:opentelemetry-bom:1.35.0

Pipeline

OTEL SDK -> Jaeger.GetSamplingStrategy -> exporter to console

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

The text was updated successfully, but these errors were encountered:

yurishkuro · 2024-03-12T14:24:28Z

Sorry, I don't understand the issue, as you keep talking about unit tests. Please describe the use case, show configuration used, and explain what behavior you are expecting and getting instead.

kuujis · 2024-03-12T16:05:08Z

Sorry, I'll try to elaborate.

I originally wanted to solve an issue in OpenTelemetry SDK - open-telemetry/opentelemetry-java#5504, which used the following strategies_sampling.json:

{
  "service_strategies": [
    {
      "service": "foobar",
      "type": "ratelimiting",
      "param": 2
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.2,
    "operation_strategies": [
      {
        "operation": "/metrics",
        "type": "probabilistic",
        "param": 0.0
      }
    ]
  }
}

I made a small java service and added OTEL tooling with all-in-one Jaeger backend and local console output for simplicity sake - https://github.com/kuujis/opentelemetry-java-5504. There are commands to build and run images, which should make the problem clear(er?) if someone feels like checking it out.

What I observed after some debugging is that GetSamplingStrategy gRCP call from OTEL towards Jaeger all-in-one app does not return the default strategies, which should be applied to specific operation ("/metrics" in the example above) IF the service sampling strategy is "ratelimiting". The default operations level strategies are added as expected for "probablistic" service sampling strategies.

After investigating why that is so - I found the part which I believe is the cause of the problem, and since its much simpler to reproduce the issue via an updated unit test which is covering that specific situation - I went with that in the bug description, instead of the whole repo and the whole story of how the issue was reproduced.

Hope this clarifies things.

…evel strategies for ratelimiting service configurations Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com>

yurishkuro · 2024-03-30T22:32:16Z

@kuujis I've been thinking how better to explain this issue. The way the Jaeger SDKs used to interpret a sampling configuration was like this:

if per-operation section is defined, it is used for everything. If the operation does not match any of the listed operations, then defaultSamplingRate is applied (along with lower/upper bounds).
otherwise, service-level strategy is applied to everything (also used in SDKs that do not support per-operation sampling)

So in your example when one of the services defines rate-limiting strategy, as soon as the per-operation sampling strategy is added the rate-limiting section becomes completely irrelevant, but only for the SDKs that actually support per-operation. This could be a big unexpected breaking change for some existing installations where if they are currently getting rate limiting on a service (but with SDK capable of per-operation) then after this change the whole service will switch to default per-operation strategy, which could mean much fewer or much more traces, no way to tell.

kuujis · 2024-03-30T23:07:18Z

I pretty much got similar impression, hence the comment in the original description on the possible "breaking of contract". The possibility of this having an unintended consequences is definitely worth considering.
Maybe it would be a good idea to provide a temporary switch via env variable, which would disable the effects of the updated functionality, so that if there are issues - there would also be a quick "emergency" solution.
The idea would be to make provide support for an upgrade flow which might look like this:

upgrade Jaeger to version n
discover the change and its impact
disable it via the flag
work on a fix in configuration files (might take some time to do right)
deploy fixed configs
remove the flag

And then at the "n+1" version - flag is removed and the fix, even if it is a breaking change, becomes the new updated behavior. Not sure how much such an approach is worth it or would be welcome though.

yurishkuro · 2024-03-31T22:28:00Z

@kuujis I think we should follow our CLI flags deprecation policy, which we can adapt as follows:

in this PR introduce a new CLI flag --sampling.strategies.bugfix-5270, which if set to true would enable the new behavior (this means the unit tests will need to test both old and new ones. I would probably just clone the original function with toBeRemoved prefix). By default this flag should be false, but when false it should log a warning that this behavior will be changed in the future (see policy link above for examples).
After 2 releases we can change the default of the flags to true, and change the warning that this flag will be deprecated altogether in another deprecation cycle
Finally remove the flag and the old code / tests.

kuujis · 2024-04-01T11:37:18Z

Looks like a good plan, I'll add these.

… enabling via sampling.strategies.bugfix-5270=true. Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com>

…egies without them (#5277) ## Which problem is this PR solving? - Part 1 of [#5270] ## Description of the changes - See issue for description of the bug and of the two changed behaviors - Default `probabilistic` operation level strategies will now be returned if service strategy is of `ratelimiting` type - If the service itself has `probabilistic` strategy, its sampling rate takes priority for operation-level definition ## How was this change tested? - Via updated unit test - By building local all-in-one executable and using it as a otel instrumentation backend for simple java service https://github.com/kuujis/opentelemetry-java-5504 and using the below `sampling_strategies.json` from open-telemetry/opentelemetry-java#5504 ``` { "service_strategies": [ { "service": "foobar", "type": "ratelimiting", "param": 2 } ], "default_strategy": { "type": "probabilistic", "param": 0.2, "operation_strategies": [ { "operation": "metrics", "type": "probabilistic", "param": 0.0 } ] } } ``` ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com> Signed-off-by: Yuri Shkuro <github@ysh.us> Co-authored-by: Yuri Shkuro <github@ysh.us> Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>

…egies without them (jaegertracing#5277) ## Which problem is this PR solving? - Part 1 of [jaegertracing#5270] ## Description of the changes - See issue for description of the bug and of the two changed behaviors - Default `probabilistic` operation level strategies will now be returned if service strategy is of `ratelimiting` type - If the service itself has `probabilistic` strategy, its sampling rate takes priority for operation-level definition ## How was this change tested? - Via updated unit test - By building local all-in-one executable and using it as a otel instrumentation backend for simple java service https://github.com/kuujis/opentelemetry-java-5504 and using the below `sampling_strategies.json` from open-telemetry/opentelemetry-java#5504 ``` { "service_strategies": [ { "service": "foobar", "type": "ratelimiting", "param": 2 } ], "default_strategy": { "type": "probabilistic", "param": 0.2, "operation_strategies": [ { "operation": "metrics", "type": "probabilistic", "param": 0.0 } ] } } ``` ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com> Signed-off-by: Yuri Shkuro <github@ysh.us> Co-authored-by: Yuri Shkuro <github@ysh.us> Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com> Signed-off-by: Vamshi Maskuri <gwcchintu@gmail.com>

kuujis added the bug label Mar 12, 2024

kuujis added a commit to kuujis/jaeger that referenced this issue Mar 15, 2024

Fix for jaegertracing#5270 - return default probabilistic operation l…

bce0053

…evel strategies for ratelimiting service configurations Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com>

kuujis mentioned this issue Mar 15, 2024

[sampling] Fix merging of per-operation strategies into service strategies without them #5277

Merged

4 tasks

yurishkuro added the changelog:breaking-change Change that is breaking public APIs or established behavior label Mar 30, 2024

kuujis added a commit to kuujis/jaeger that referenced this issue Apr 6, 2024

Fix for jaegertracing#5270 is disabled by default. Added a switch for…

7b1dec3

… enabling via sampling.strategies.bugfix-5270=true. Signed-off-by: Kazimieras Pociunas <kazimieras.pociunas@gmail.com>

kuujis mentioned this issue Apr 20, 2024

JaegerRemoteSampler inconsistent Implementation open-telemetry/opentelemetry-java#5504

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270

[Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270

kuujis commented Mar 12, 2024 •

edited by yurishkuro

yurishkuro commented Mar 12, 2024

kuujis commented Mar 12, 2024 •

edited

yurishkuro commented Mar 30, 2024

kuujis commented Mar 30, 2024

yurishkuro commented Mar 31, 2024 •

edited

kuujis commented Apr 1, 2024

[Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270

[Bug]: Operation based sampling strategies are not being returned for ratelimiting service strategies #5270

Comments

kuujis commented Mar 12, 2024 • edited by yurishkuro

Original ticket below:

What happened?

Steps to reproduce

Expected behavior

Relevant log output

Screenshot

Additional context

Jaeger backend version

SDK

Pipeline

Stogage backend

Operating system

Deployment model

Deployment configs

yurishkuro commented Mar 12, 2024

kuujis commented Mar 12, 2024 • edited

yurishkuro commented Mar 30, 2024

kuujis commented Mar 30, 2024

yurishkuro commented Mar 31, 2024 • edited

kuujis commented Apr 1, 2024

kuujis commented Mar 12, 2024 •

edited by yurishkuro

kuujis commented Mar 12, 2024 •

edited

yurishkuro commented Mar 31, 2024 •

edited