Unusual spike in response with 499 status code #2066

kishan-vachhani · 2024-05-14T16:54:23Z

Expected Behavior

The response from the downstream service should be forwarded for the incoming request, and the gateway should not return a 499 status code.

Actual Behavior

Random requests to the downstream service are being canceled, resulting in the gateway returning a 499 status code.

Steps to Reproduce the Problem

I don't have the exact steps to reproduce this issue, but it seems to occur more frequently for routes with high incoming request rates, such as webhooks. These requests are primarily automated, reducing the likelihood of manual cancellation of the CancellationToken.

Upon reviewing the change log for the major release of version 23.0.0, I noticed updates to the downstream implementation for performance enhancement. This includes the introduction of HttpMessageInvoker and the addition of PooledConnectionIdleTimeout. Could these changes be contributing to the issue?

I will continue investigating and update this issue with any additional findings or if I can identify the exact steps to reproduce the problem. Any assistance in identifying the cause would be appreciated.

Specifications

Version: 23.0.0+

If anyone are facing this the same issue are welcome to add more details or finding about this.

The text was updated successfully, but these errors were encountered:

raman-m · 2024-05-15T08:32:39Z

Hi Kishan!
Welcome to Ocelot world! 🐅

I don't believe that Ocelot contains a major bug that would manifest as a spike in your logs. However, let's brainstorm possibilities...

Actual Behavior

Random requests to the downstream service are being canceled, resulting in the gateway returning a 499 status code.

Could you point me to any code snippet where Ocelot forcibly cancels upstream or downstream requests on its own initiative?
Yes, we utilize the CancellationToken from the HTTP request to propagate it from upstream to downstream. It is imperative that Ocelot forwards this token. Thus, if the upstream client cancels the request (as detected by the end of communication in HTTP 1.1+ protocol), then Ocelot is also required to cancel the downstream request(s). This may result in a surge of log entries. Beyond this, I am uncertain.

Expected Behavior

The response from the downstream service should be forwarded for the incoming request, and the gateway should not return a 499 status code.

Do you think that if the downstream request is cancelled, the service still returns a body that we need to relay back upstream? Interesting... Why do you need this response? What will you do with the technical data in the body? Isn't a 499 status code sufficient for the upstream client to make decisions?
@ggnaegi Gui, do we send back the response/body if the downstream request was cancelled by a CancellationToken?

Steps to Reproduce the Problem

I don't have the exact steps to reproduce this issue, but it seems to occur more frequently for routes with high incoming request rates, such as webhooks. These requests are primarily automated, reducing the likelihood of manual cancellation of the CancellationToken.

Cancelled requests can be replicated easily through page reloading from a browser. However, regarding webhooks, some systems may cancel an ongoing webhook request if there's a new state or it's re-triggered.
Which specific webhooks are in question? Do you know the cause behind the cancellation of an active webhook?

Upon reviewing the change log for the major release of version 23.0.0, I noticed updates to the downstream implementation for performance enhancement. This includes the introduction of HttpMessageInvoker and the addition of PooledConnectionIdleTimeout. Could these changes be contributing to the issue?

It's unclear. Have you attempted deploying Ocelot versions prior to 23.0.0? What were the outcomes? Did you observe similar spikes in the logs?

Theoretically, the new changes to the Ocelot kernel in v23.0.0 could affect webhook behavior, but further investigation is required.
@ggnaegi, what is your perspective?

I will continue investigating and update this issue with any additional findings or if I can identify the exact steps to reproduce the problem. Any assistance in identifying the cause would be appreciated.

Currently, we cannot determine the root cause as we do not oversee your environment. However, we can collaborate on brainstorming, and collectively, we can suggest the subsequent steps for identification.

Specifications

Version: 23.0.0+

Understood!

If anyone are facing this the same issue are welcome to add more details or finding about this.

It's commonly believed that software built using SaaS or SOA architectures invariably encounters "spikes" problems in graph logs. 😄

ggnaegi · 2024-05-15T08:53:43Z

@raman-m it's here:

Ocelot/src/Ocelot/Errors/Middleware/ExceptionHandlerMiddleware.cs

Lines 40 to 47 in 6e9a975

    
           catch (OperationCanceledException) when (httpContext.RequestAborted.IsCancellationRequested) 
        
           { 
        
               Logger.LogDebug("operation canceled"); 
        
               if (!httpContext.Response.HasStarted) 
        
               { 
        
                   httpContext.Response.StatusCode = 499; 
        
               } 
        
           }

I haven't checked it yet, but what could cause this is the default request timeout, 90 seconds...

PooledConnectionIdleTimeout shouldn't be the cause imo, but we could investigate it further.

SocketsHttpHandler.PooledConnectionIdleTimeout Property

raman-m · 2024-05-15T09:09:57Z

@ggnaegi Thanks! I'm aware of all the references to the 499 status in our code. Indeed, timeout events cancel requests and can cause some "spikes." However, in this instance, I'm puzzled by the issue reporting.

@kishan-vachhani, could you please take a screenshot of the entire page showing the spike and share it with us? What type of spikes are you experiencing? Additionally, please provide more details from your logs or the graphs from your monitoring tool.

kishan-vachhani · 2024-05-15T14:41:07Z

Could you point me to any code snippet where Ocelot forcibly cancels upstream or downstream requests on its own initiative? Yes, we utilize the CancellationToken from the HTTP request to propagate it from upstream to downstream. It is imperative that Ocelot forwards this token. Thus, if the upstream client cancels the request (as detected by the end of communication in HTTP 1.1+ protocol), then Ocelot is also required to cancel the downstream request(s). This may result in a surge of log entries. Beyond this, I am uncertain.

@raman-m I'm currently conducting further investigation to determine the source of the request cancellation. I understand that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

Do you think that if the downstream request is cancelled, the service still returns a body that we need to relay back upstream? Interesting... Why do you need this response? What will you do with the technical data in the body? Isn't a 499 status code sufficient for the upstream client to make decisions? @ggnaegi Gui, do we send back the response/body if the downstream request was cancelled by a CancellationToken?

I agree with you on if the downstream request is cancelled, its response shouldn't be relayed in the upstream response. Here, I was trying to convey that requests shouldn't be cancelled unless it's done manually or due to a timeout.

Cancelled requests can be replicated easily through page reloading from a browser. However, regarding webhooks, some systems may cancel an ongoing webhook request if there's a new state or it's re-triggered. Which specific webhooks are in question? Do you know the cause behind the cancellation of an active webhook?

Yes, by refreshing the browser or closing the tab which is executing will cancel the request but It's concerning that we're observing cancellations in production for routes (not only webhook ones) that have a very low probability of such actions, like refreshing the browser or closing the tab etc. This behavior seems unexpected and imo requires further investigation.

It's unclear. Have you attempted deploying Ocelot versions prior to 23.0.0? What were the outcomes? Did you observe similar spikes in the logs?

Theoretically, the new changes to the Ocelot kernel in v23.0.0 could affect webhook behavior, but further investigation is required. @ggnaegi, what is your perspective?

It appears that after upgrading to the latest version of Ocelot, we've observed a significant increase in the occurrences of the 499 response code, as shown in the first attached image. This notable change prompted me to delve deeper into understanding the root cause behind this surge. Especially considering that I was using a lower version of Ocelot previously.

Currently, we cannot determine the root cause as we do not oversee your environment. However, we can collaborate on brainstorming, and collectively, we can suggest the subsequent steps for identification.

Certainly, I grasp your perspective. To aid in your comprehension, I've attached a screenshot containing all the logs pertaining to a single request that resulted in a 499 response. I'm seeking collaborative efforts to identify and rectify this issue (if really it is). In the mean time, could you please provide guidance on potential methods to pinpoint the source of cancellation? One notable change I've observed is the shift from utilizing the HTTP Client's Timeout property to employing the TimeoutDelegatingHandler in combination with the CancellationToken.

It's commonly believed that software built using SaaS or SOA architectures invariably encounters "spikes" problems in graph logs. 😄

Yeah true 😄

ggnaegi · 2024-05-15T15:17:16Z

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler. What would be great is to identify a scenario that we could reproduce.

raman-m · 2024-05-16T10:22:30Z

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries?
How can we ensure this is graph of monitored Ocelot instance?

Could you attach (copy-paste) all content of your ocelot.json please?
We need to look at your configuration.
Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

ggnaegi · 2024-05-16T10:35:34Z

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler. What would be great is to identify a scenario that we could reproduce.

I can't see major differences between the timeout logic in http client and the delegating handler we have implemented

raman-m · 2024-05-16T10:50:14Z

@ggnaegi commented on May 15:

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler.

Gui, is this the logic you're referring to? 👉

Ocelot/src/Ocelot/Requester/MessageInvokerPool.cs

Lines 59 to 66 in 6e9a975

    
           // Adding timeout handler to the top of the chain. 
        
           // It's standard behavior to throw TimeoutException after the defined timeout (90 seconds by default) 
        
           var timeoutHandler = new TimeoutDelegatingHandler(downstreamRoute.QosOptions.TimeoutValue == 0 
        
               ? TimeSpan.FromSeconds(RequestTimeoutSeconds) 
        
               : TimeSpan.FromMilliseconds(downstreamRoute.QosOptions.TimeoutValue)) 
        
           { 
        
               InnerHandler = baseHandler, 
        
           };

🆗... Here's my understanding of the reported "spikes" issue:

The TimeoutDelegatingHandler is responsible for cancelling requests after the default 90 seconds.
The developer did not specify any timeouts for routes, so the default value of 90 seconds is used.
Webhooks are received by Ocelot and forwarded to downstream services (webhook receivers).
Downstream services may be offline, or the downstream system may use load balancing with services going offline/online.
In case of absent response, after 90 seconds, the Ocelot downstream request is cancelled, and a record is written into the log with a 499 status, correct?

Considering it a problem may not be necessary; it's not an issue with Ocelot itself, but rather incidents of no response from the downstream system, leading Ocelot to naturally cancel the requests. The absence of spikes before the deployment of v23.0.0 is because Ocelot did not generate the 499 status prior to this version, correct? Since the introduction of v23.0, Ocelot has been producing the 499 status, which the monitoring tool logs, resulting in the observed spikes. Bingo! 💥

@ggnaegi Is this the same conclusion you've reached?
It seems we are handling a user scenario where downstream requests are being swallowed, which is why Ocelot is cancelling them with a 499 status code.
This issue stems from the problem of overloaded webhook receivers. In my opinion, we should inquire with the author about which webhook tools or products are currently in use as deployed downstream services.

kishan-vachhani · 2024-05-16T12:05:11Z

@ggnaegi commented on May 15:

@kishan-vachhani @raman-m Ok I will compare the Timeout in HttpClient with our custom Timeout Delegating Handler.

Gui, is this the logic you're referring to? 👉

Ocelot/src/Ocelot/Requester/MessageInvokerPool.cs

Lines 59 to 66 in 6e9a975

// Adding timeout handler to the top of the chain.

// It's standard behavior to throw TimeoutException after the defined timeout (90 seconds by default)

var timeoutHandler = new TimeoutDelegatingHandler(downstreamRoute.QosOptions.TimeoutValue == 0

? TimeSpan.FromSeconds(RequestTimeoutSeconds)

: TimeSpan.FromMilliseconds(downstreamRoute.QosOptions.TimeoutValue))

{

InnerHandler = baseHandler,

};

🆗... Here's my understanding of the reported "spikes" issue:

The TimeoutDelegatingHandler is responsible for cancelling requests after the default 90 seconds.

The developer did not specify any timeouts for routes, so the default value of 90 seconds is used.

Webhooks are received by Ocelot and forwarded to downstream services (webhook receivers).

Downstream services may be offline, or the downstream system may use load balancing with services going offline/online.

In case of absent response, after 90 seconds, the Ocelot downstream request is cancelled, and a record is written into the log with a 499 status, correct?

Considering it a problem may not be necessary; it's not an issue with Ocelot itself, but rather incidents of no response from the downstream system, leading Ocelot to naturally cancel the requests. The absence of spikes before the deployment of v23.0.0 is because Ocelot did not generate the 499 status prior to this version, correct? Since the introduction of v23.0, Ocelot has been producing the 499 status, which the monitoring tool logs, resulting in the observed spikes. Bingo! 💥

@ggnaegi Is this the same conclusion you've reached? It seems we are handling a user scenario where downstream requests are being swallowed, which is why Ocelot is cancelling them with a 499 status code. This issue stems from the problem of overloaded webhook receivers. In my opinion, we should inquire with the author about which webhook tools or products are currently in use as deployed downstream services.

@raman-m Yes, from the code, it appears that if no timeout is specified, the gateway will use the default timeout of 90 seconds. If the downstream application does not respond within this timeframe, it should throw an exception.

Since I haven't configured any Quality of Service (QoS) settings or specified a timeout, it defaults to 90 seconds. Moreover, the downstream application is operational. As evident from the screenshot of the single request trace provided earlier, the gateway responded with a 499 status code within 148.4331 milliseconds. This indicates that the response time was well within the default timeout period of 90 seconds (same is the case for all).

Furthermore, with the introduction of new logic in the 23.0.0 release, a Timeout Error is returned with status code 503. It's worth noting that Ocelot did generate the 499 status prior to this version as well.

Ocelot/src/Ocelot/Requester/HttpExceptionToErrorMapper.cs

Lines 34 to 38 in 6e9a975

    
           // here are mapped the exceptions thrown from Ocelot core application 
        
           if (type == typeof(TimeoutException)) 
        
           { 
        
               return new RequestTimedOutError(exception); 
        
           }

Also, according to the code snippet below, if the downstream application is taking too long to respond or is unavailable, the cancellationToken.IsCancellationRequested should be false. This condition triggers the throw of a TimeoutException, resulting in a response status code of 503.

Ocelot/src/Ocelot/Requester/TimeoutDelegatingHandler.cs

Lines 16 to 30 in 6e9a975

    
           protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, 
        
               CancellationToken cancellationToken) 
        
           { 
        
               using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); 
        
               cts.CancelAfter(_timeout); 
        
               try 
        
               { 
        
                   return await base.SendAsync(request, cts.Token); 
        
               } 
        
               catch (OperationCanceledException) when (!cancellationToken.IsCancellationRequested) 
        
               { 
        
                   throw new TimeoutException(); 
        
               } 
        
           }

IMO, something is triggering cancellation token prematurely. 🤔

ggnaegi · 2024-05-16T12:10:55Z

@kishan-vachhani Could you give us some metrics about your environment? Such as Requests per second etc... From our side, it's very difficult to draw some conclusions without more detailled observations. Besides the changes were tested and rolled out on production environments (very heavy load).

kishan-vachhani · 2024-05-16T13:42:13Z

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries? How can we ensure this is graph of monitored Ocelot instance?

Could you attach (copy-paste) all content of your ocelot.json please? We need to look at your configuration. Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

@raman-m I'm not utilizing the Quality of Service (QoS) feature for any of my routes. The Y-axis of the graph represents the number of responses with 499 status codes, while the X-axis represents the timeline.

Unfortunately, I cannot share my ocelot.json file due to its production status. However, I can provide the schema of the properties in use. I haven't overridden any middleware, but for certain routes, I am employing a custom delegating handler. It's worth noting that the issue we are discussing is affecting routes both with and without the custom delegation handler.

{
	"UpstreamPathTemplate": "/upstream/route",
	"UpstreamHttpMethod": [
		"Get",
		"Options"
	],
	"DownstreamPathTemplate": "/downstream/route",
	"DownstreamScheme": "https",
	"DownstreamHostAndPorts": [
		{
			"Host": "downstream-host",
			"Port": 443
		}
	],
	"AuthenticationOptions": {
		"AuthenticationProviderKey": "Bearer",
		"AllowedScopes": [
			"route:read"
		]
	},
	"UpstreamHeaderTransform": {
		"X-Forwarded-Host": "abc.com"
	},
	"DelegatingHandlers": [
		"CustomDelegatingHandler"
	]
}

kishan-vachhani · 2024-05-16T14:12:08Z

@kishan-vachhani Could you give us some metrics about your environment? Such as Requests per second etc... From our side, it's very difficult to draw some conclusions without more detailled observations. Besides the changes were tested and rolled out on production environments (very heavy load).

@ggnaegi The issue I'm currently encountering in the production environment involves managing throughput, averaging 2.37k requests per minute (rpm) over the past 24 hours. During peak hours, this figure rises to 8k rpm.

ggnaegi · 2024-05-16T17:02:06Z

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation?
https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m?
171e3a7
102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

    
           private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response) 
        
           { 
        
               var status = response.Data?.StatusCode ?? HttpStatusCode.Processing; 
        
               var reason = response.Data?.ReasonPhrase ?? "unknown";

raman-m · 2024-05-16T18:55:51Z

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation?
https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m?
171e3a7
102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response)

{

var status = response.Data?.StatusCode ?? HttpStatusCode.Processing;

var reason = response.Data?.ReasonPhrase ?? "unknown";

@ggnaegi,
How is this your researched code related to the author's 499 spike problem?
I don't see any relationship!
Also I don't see a problem in logging warnings if status >= 400 in #1953. This is not error reporting, this is warning one. The author must increase logging level from Warning to Error and all spikes will disappear.

Do you want to discuss #1953 changes or do you want to find root cause of the reported issue?
I'm a bit tired today to discuss this issue...

ggnaegi · 2024-05-16T19:04:07Z

@raman-m I was looking for the error and then this 102 status code popped up. This is not the truth, why would you write a message with a status code that is not correct? It's only a symptom. We might have indeed a threading issue somewhere...

As a matter of fact, MessageInvoker.SendAsync is thread safe, but yes, we might have a problem with some delegating handlers, and @kishan-vachhani it could be your delegating handler too... I will check the timeout delegating handler again.

After a short review, the design of the Timeout Handler is imo thread safe:
the timeout field is readonly, so immutable
the CancellationTokenSource object is only used within the SendAsync method and then disposed
And again, it would throw a TimeoutException

... Further investigations tomorrow...

RaynaldM · 2024-05-17T06:59:25Z

@kishan-vachhani Do you use QoS feature for the routes?

that if the upstream request is cancelled, the downstream request should also be cancelled. This could potentially result in a spike in log entries. What concerns me is that I've noticed a consistent spike in log entries following the deployment of the Ocelot version upgrade (on 05/08/2024). Please refer to the image below.

I'm confused by this graph. What does the Y-axis represent? Is it the number of 499 status codes, or is it the count of log entries? How can we ensure this is graph of monitored Ocelot instance?
Could you attach (copy-paste) all content of your ocelot.json please? We need to look at your configuration. Do you have some custom setup for Ocelot: delegating handlers, middleware overridings, service replacement in DI?

@raman-m I'm not utilizing the Quality of Service (QoS) feature for any of my routes. The Y-axis of the graph represents the number of responses with 499 status codes, while the X-axis represents the timeline.

Unfortunately, I cannot share my ocelot.json file due to its production status. However, I can provide the schema of the properties in use. I haven't overridden any middleware, but for certain routes, I am employing a custom delegating handler. It's worth noting that the issue we are discussing is affecting routes both with and without the custom delegation handler.
{
	"UpstreamPathTemplate": "/upstream/route",
	"UpstreamHttpMethod": [
		"Get",
		"Options"
	],
	"DownstreamPathTemplate": "/downstream/route",
	"DownstreamScheme": "https",
	"DownstreamHostAndPorts": [
		{
			"Host": "downstream-host",
			"Port": 443
		}
	],
	"AuthenticationOptions": {
		"AuthenticationProviderKey": "Bearer",
		"AllowedScopes": [
			"route:read"
		]
	},
	"UpstreamHeaderTransform": {
		"X-Forwarded-Host": "abc.com"
	},
	"DelegatingHandlers": [
		"CustomDelegatingHandler"
	]
}

The "rustic" way of managing the timout without QoS is, I think, the source of your problems (we have several open issues on the subject, it should at least be configurable).
If you use QoS, you won't have those 499.

RaynaldM · 2024-05-17T07:02:03Z

@kishan-vachhani Ok, the latest version is running on a production environment showing the following metrics, on average (24h): 650 requests per second, 39k requests/minute.

I checked the request, why do you have 102 status code, and it's unknown. Maybe this is the cause of the cancellation? https://evertpot.com/http/102-processing

... wait a minute... Why did we do that dear @raman-m? 171e3a7 102 Processing is for old webdav stuff, why are we showing that misleading message here?

Ocelot/src/Ocelot/Requester/Middleware/HttpRequesterMiddleware.cs

Lines 40 to 43 in 171e3a7

private void CreateLogBasedOnResponse(Response<HttpResponseMessage> response)

{

var status = response.Data?.StatusCode ?? HttpStatusCode.Processing;

var reason = response.Data?.ReasonPhrase ?? "unknown";

@ggnaegi I can confirm what I told you yesterday: no 499 in the last 48 hours (I can't go back any further).

ggnaegi · 2024-05-17T07:57:50Z

@RaynaldM Thanks a lot!

ggnaegi · 2024-05-17T09:02:52Z

The "rustic" way of managing the timout without QoS is, I think, the source of your problems (we have several open issues on the subject, it should at least be configurable). If you use QoS, you won't have those 499.

@raman-m @RaynaldM Maybe we should move the default timeout to the QoS and provide some global parameters to it.

RaynaldM · 2024-05-17T09:21:34Z

Maybe we should move the default timeout to the QoS and provide some global parameters to it.

I don't think so, they're 2 very different systems.
The standard timout has very basic mechanics, but that may be enough for some. It just needs to be configurable globally, or even by endpoint.
But to work with a wide variety of APIs, it's better to use QoS, which is much more robust in production (the heterogeneity of response times for certain APIs is the problem).

ggnaegi · 2024-05-17T09:54:14Z

@RaynaldM Ok, but we could use a default Polly implementation and as soon as QoS parameters are defined use the QoS... We wouldn't have the timeout as delegating handler and we would avoid discussions with colleagues using the solution.

ggnaegi · 2024-05-17T11:07:42Z

But @kishan-vachhani I'm quite sure the delegating handler is thread safe though...

kishan-vachhani · 2024-05-17T11:12:56Z

@raman-m I was looking for the error and then this 102 status code popped up. This is not the truth, why would you write a message with a status code that is not correct? It's only a symptom. We might have indeed a threading issue somewhere...

As a matter of fact, MessageInvoker.SendAsync is thread safe, but yes, we might have a problem with some delegating handlers, and @kishan-vachhani it could be your delegating handler too... I will check the timeout delegating handler again.

After a short review, the design of the Timeout Handler is imo thread safe: the timeout field is readonly, so immutable the CancellationTokenSource object is only used within the SendAsync method and then disposed And again, it would throw a TimeoutException

... Further investigations tomorrow...

I also think there could be an issue with the thread that might be causing request cancellation due to a race condition. I've reviewed my custom delegating handler, but had no luck. However, this pattern of 499 status codes persists for routes without a custom delegating handler as well.

@RaynaldM @ggnaegi The issue is not caused by individual requests hitting the 90-second timeout threshold, so setting up QoS may not be helpful. If I am mistaken, please let me know.

raman-m · 2024-05-17T11:23:00Z

I've had enough of this debate! Currently, I perceive no problems with Ocelot.
Consider shifting this to a discussion format where it might be more engaging.

@kishan-vachhani, I encourage you to partake in the discussion more light-heartedly.
Your theoretical conclusions and speculations do not captivate me.
Cease causing distress to my team!
You are obliged to demonstrate that there is indeed a bug in Ocelot.

raman-m added waiting Waiting for answer to question or feedback from issue raiser Routing Ocelot feature: Routing labels May 15, 2024

ThreeMammals locked and limited conversation to collaborators May 17, 2024

raman-m converted this issue into discussion #2072 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Unusual spike in response with 499 status code #2066

Unusual spike in response with 499 status code #2066

kishan-vachhani commented May 14, 2024 •

edited by raman-m

raman-m commented May 15, 2024

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Specifications

ggnaegi commented May 15, 2024 •

edited by raman-m

raman-m commented May 15, 2024

kishan-vachhani commented May 15, 2024

ggnaegi commented May 15, 2024

raman-m commented May 16, 2024 •

edited

ggnaegi commented May 16, 2024 •

edited

raman-m commented May 16, 2024 •

edited

kishan-vachhani commented May 16, 2024

ggnaegi commented May 16, 2024 •

edited

kishan-vachhani commented May 16, 2024

kishan-vachhani commented May 16, 2024

ggnaegi commented May 16, 2024 •

edited

raman-m commented May 16, 2024

ggnaegi commented May 16, 2024 •

edited

RaynaldM commented May 17, 2024

RaynaldM commented May 17, 2024

ggnaegi commented May 17, 2024

ggnaegi commented May 17, 2024

RaynaldM commented May 17, 2024

ggnaegi commented May 17, 2024 •

edited

ggnaegi commented May 17, 2024

kishan-vachhani commented May 17, 2024

raman-m commented May 17, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Unusual spike in response with 499 status code #2066

Unusual spike in response with 499 status code #2066

Comments

kishan-vachhani commented May 14, 2024 • edited by raman-m

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

raman-m commented May 15, 2024

Actual Behavior

Expected Behavior

Steps to Reproduce the Problem

Specifications

ggnaegi commented May 15, 2024 • edited by raman-m

raman-m commented May 15, 2024

kishan-vachhani commented May 15, 2024

ggnaegi commented May 15, 2024

raman-m commented May 16, 2024 • edited

ggnaegi commented May 16, 2024 • edited

raman-m commented May 16, 2024 • edited

kishan-vachhani commented May 16, 2024

ggnaegi commented May 16, 2024 • edited

kishan-vachhani commented May 16, 2024

kishan-vachhani commented May 16, 2024

ggnaegi commented May 16, 2024 • edited

raman-m commented May 16, 2024

ggnaegi commented May 16, 2024 • edited

RaynaldM commented May 17, 2024

RaynaldM commented May 17, 2024

ggnaegi commented May 17, 2024

ggnaegi commented May 17, 2024

RaynaldM commented May 17, 2024

ggnaegi commented May 17, 2024 • edited

ggnaegi commented May 17, 2024

kishan-vachhani commented May 17, 2024

raman-m commented May 17, 2024

This issue was moved to a discussion.

kishan-vachhani commented May 14, 2024 •

edited by raman-m

ggnaegi commented May 15, 2024 •

edited by raman-m

raman-m commented May 16, 2024 •

edited

ggnaegi commented May 16, 2024 •

edited

raman-m commented May 16, 2024 •

edited

ggnaegi commented May 16, 2024 •

edited

ggnaegi commented May 16, 2024 •

edited

ggnaegi commented May 16, 2024 •

edited

ggnaegi commented May 17, 2024 •

edited