Introduce HTTP2 health check transport options #9022

dloucasfx · 2023-11-30T21:39:48Z

Description:
This PR introduces options to enable http/2 health check by exposing HTTP2ReadIdleTimeout and HTTP2PingTimeout

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS tcp_retries2 ) until the OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state.

Testing:

Run OTEL with one of the exporters that uses HTTP/2 client, example signalfx exporter
For simplicity use a single pipeline/exporter
In a different shell, run this to watch the tcp state of the established connection

 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done

From the netstat, take a note of the source port and the source IP address
replace <> from previous step
sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP
Note how the OTEL exporter export starts timing out

Expected Result:

A new connection should be established, similarly to http/1 and exports should succeed

Actual Result:

The exports keep failing for ~ 15 minutes or for whatever the OS tcp_retries2 is configured to
After 15 minutes, a new tcp connection is created and exports start working

Documentation:
Readme is updated with new settings

codecov · 2023-11-30T21:44:27Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (111037d) 91.57% compared to head (4718e11) 91.54%.
Report is 2 commits behind head on main.

Files	Patch %	Lines
config/confighttp/confighttp.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9022      +/-   ##
==========================================
- Coverage   91.57%   91.54%   -0.04%     
==========================================
  Files         316      316              
  Lines       17147    17154       +7     
==========================================
+ Hits        15702    15703       +1     
- Misses       1150     1154       +4     
- Partials      295      297       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

config/confighttp/confighttp.go

.chloggen/http2ping.yaml

config/confighttp/confighttp.go

Signed-off-by: Dani Louca <dlouca@splunk.com>

dmitryax

LGTM

**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>

Same description as in open-telemetry/opentelemetry-collector#9022 This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022 As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and 10 seconds for `HTTP2PingTimeout` Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure or (2 max) before the health check detects the corrupted tcp connection and closes it. The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings. The SFX exporter has multiples http clients: - Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config - Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed. **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** <Describe the documentation added.> Readme is updated Signed-off-by: Dani Louca <dlouca@splunk.com>

**Description:** This PR enables the HTTP2 health check to workaround the issue described here open-telemetry/opentelemetry-collector#9022 As to why I chose 10 seconds for `HTTP2ReadIdleTimeout` and ~~5 seconds~~ 10 seconds (see review comment) for `HTTP2PingTimeout` Those values have been tested in production and they will result, in an active env (with default http timeout of 10 seconds and default retry settings), of a single export failure at max before the health check detects the corrupted tcp connection and closes it. The only drawback is if the connection was not used for over 10 seconds, we might end up sending unnecessary ping frames, which should not be an issue and if it became an issue, then we can tune those settings. The SFX exporter has multiples http clients: - Metric client, Trace client and Event client . Those client will have the http2 health check enabled by default as they share the same default config - Correlation client and Dimension client will NOT have the http2 health check enabled. We can revisit this if needed. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** <Describe the documentation added.> Readme is updated **Disclaimer:** Not all HTTP/2 servers support H2 Ping, however, this should not be a concern as our ingest servers do support H2 ping. But if you are routing you can check if H2 ping is supported using this script golang/go#60818 (comment) Signed-off-by: Dani Louca <dlouca@splunk.com>

dloucasfx requested a review from a team as a code owner November 30, 2023 21:39

dloucasfx requested a review from Aneurysm9 November 30, 2023 21:39

dloucasfx force-pushed the http2ping branch from df5c167 to 826a200 Compare November 30, 2023 21:41

atoulme reviewed Nov 30, 2023

View reviewed changes

config/confighttp/confighttp.go Show resolved Hide resolved

atoulme reviewed Nov 30, 2023

View reviewed changes

.chloggen/http2ping.yaml Outdated Show resolved Hide resolved

rmfitzpatrick reviewed Nov 30, 2023

View reviewed changes

config/confighttp/confighttp.go Outdated Show resolved Hide resolved

rmfitzpatrick approved these changes Nov 30, 2023

View reviewed changes

rmfitzpatrick reviewed Nov 30, 2023

View reviewed changes

config/confighttp/confighttp.go Outdated Show resolved Hide resolved

dloucasfx force-pushed the http2ping branch 2 times, most recently from a43ca0c to 0a049f4 Compare November 30, 2023 22:36

atoulme mentioned this pull request Dec 6, 2023

Compression_error and crashing #9050

Open

dloucasfx force-pushed the http2ping branch from 0a049f4 to 15b487d Compare December 6, 2023 16:09

dmitryax reviewed Dec 7, 2023

View reviewed changes

config/confighttp/confighttp.go Outdated Show resolved Hide resolved

dloucasfx force-pushed the http2ping branch from 15b487d to b50fb7e Compare December 7, 2023 19:12

Introduce HTTP2 health check transport options

4718e11

Signed-off-by: Dani Louca <dlouca@splunk.com>

dloucasfx force-pushed the http2ping branch from b50fb7e to 4718e11 Compare December 7, 2023 19:14

dmitryax approved these changes Dec 7, 2023

View reviewed changes

dmitryax merged commit f9a38b2 into open-telemetry:main Dec 8, 2023
31 of 32 checks passed

github-actions bot added this to the next release milestone Dec 8, 2023

This was referenced Dec 8, 2023

[exporter/signalfx] enabled http2 healthcheck open-telemetry/opentelemetry-collector-contrib#29716

Merged

[exporter/splunkhec] enabled http2 healthcheck open-telemetry/opentelemetry-collector-contrib#29717

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce HTTP2 health check transport options #9022

Introduce HTTP2 health check transport options #9022

dloucasfx commented Nov 30, 2023 •

edited

codecov bot commented Nov 30, 2023 •

edited

dmitryax left a comment

Introduce HTTP2 health check transport options #9022

Introduce HTTP2 health check transport options #9022

Conversation

dloucasfx commented Nov 30, 2023 • edited

codecov bot commented Nov 30, 2023 • edited

Codecov Report

dmitryax left a comment

Choose a reason for hiding this comment

dloucasfx commented Nov 30, 2023 •

edited

codecov bot commented Nov 30, 2023 •

edited