Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A80: gRPC Metrics for TCP connection #428

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

nanahpang
Copy link

@nanahpang nanahpang commented Apr 22, 2024

Add the proposal of gRPC TCP per-connection metrics.

Copy link
Contributor

@yousukseung yousukseung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me as discussed offline, I'll leave others to approve.


| Name | Disposition | Description |
| ----------- | ----------- | ----------- |
| grpc.tcp.remote_peer_address | optional | Store the peer address info in the format as `ip:port`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth @yashykt I feel like there's an equivalent label already in use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a relatively expensive fan-out for a busy server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we discussed about this offline. Is there a way to support an opaque user-defined peer info string? In prod, by default we only record the cell information due to the fan-out concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out the right format here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose URI form, such as ipv4:1.2.3.4:567. @ejona86 @dfawley, would that be a problem for Java and Go if we ever added these metrics in your languages?


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Reports TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashykt how well does seconds here align with other metrics we have?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are following OpenTelemetry's conventions mostly which advocates for seconds 's' as the unit for duration

| grpc.tcp.delivery_rate | Distribution | bit/s | grpc.tcp.remote_peer_string | Records the most recent non-app-limited throughput at the time that Fathom samples the connection statistics. |
| grpc.tcp.packets_sent | Counter | {packet} | grpc.tcp.remote_peer_string | Records total packets TCP sends in the calculation period. |
| grpc.tcp.packets_retransmitted | Counter | {packet} | grpc.tcp.remote_peer_string | Records total packets lost in the calculation period, including lost or spuriously retransmitted packets. |
| grpc.tcp.packets_spurious_retransmitted | Counter | {packet} | grpc.tcp.remote_peer_string | Records total packets spuriously retransmitted packets in the calculation period. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what spurious means in this context... maybe we could link to some write-up, or include some more text defining these things?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find an authoritative source we can link, but I think we can include what's on go/fathom-metrics i.e. "These are retransmissions that TCP later discovered unnecessary.". The definition is public.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will add that in the description.


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Reports TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of remarks on this:

  1. Do you know how portable this is going to be? It would help to explain how you plan to collect that data. Do you plan to use getsockopt + TCP_INFO that yields a min_rtt in some cases? Even on linux, I'm not sure all TCP algorithms measure the minimum RTT.
  2. What is a Distribution type? Do you mean a histogram?
  3. Since this is a property of the transport, this probably belongs under grpc.transport.tcp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think min_rtt is general enough that we can include, I presume most congestion control tracks RTT or equivalent to keep track of the estimated BDP.

And I think this gRFC is supposed to cover the definition of the metric and not about the collection method. @yashykt Yes, in Linux we can get it via either getsockopt(TCP_INFO) or TCP_NLA_MIN_RTT from cmsg when timestamping is enabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think min_rtt is general enough that we can include, I presume most congestion control tracks RTT or equivalent to keep track of the estimated BDP.

OK, and even if it weren't the case, there is probably no problem with omitting it.

And I think this gRFC is supposed to cover the definition of the metric and not about the collection method.

OK, yes. My point is more that the definition needs to be precise enough so that different implementations measure the same thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if it's important for the correctness of the metrics, it makes sense to give more information on this metric. Maybe you could say that a valid implementation of this is through getsockopt?

Copy link
Author

@nanahpang nanahpang May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a summary of how the metrics are collected and linked some references. Let me know if anything is unclear. Thanks.

| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Reports TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
| grpc.tcp.delivery_rate | Distribution | bit/s | grpc.tcp.remote_peer_string | Records the most recent non-app-limited throughput at the time that Fathom samples the connection statistics. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Fathom? A quick internet search yields https://dl.acm.org/doi/pdf/10.1145/3603269.3604815 which sounds related (but I have not read it)... Do you think anything that uses it is going to useful to anyone outside Google?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and app-limitedness is also not a general property either. I think we should update to something like "Latest throughput measured of the TCP connection."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.

A80-grpc-metrics-for-tcp-connection Outdated Show resolved Hide resolved

| Name | Disposition | Description |
| ----------- | ----------- | ----------- |
| grpc.tcp.remote_peer_address | optional | Store the peer address info in the format as `ip:port`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Records TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
Copy link
Member

@yashykt yashykt Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should add the remote_string and maybe the local_string as optional labels on these metrics

Copy link
Author

@nanahpang nanahpang Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Reports TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are following OpenTelemetry's conventions mostly which advocates for seconds 's' as the unit for duration


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Distribution | s | grpc.tcp.remote_peer_string | Reports TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if it's important for the correctness of the metrics, it makes sense to give more information on this metric. Maybe you could say that a valid implementation of this is through getsockopt?

Copy link
Member

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(approved by mistake)


| Name | Disposition | Description |
| ----------- | ----------- | ----------- |
| grpc.tcp.remote_peer_address | optional | Store the peer address info in the format as `ip:port`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose URI form, such as ipv4:1.2.3.4:567. @ejona86 @dfawley, would that be a problem for Java and Go if we ever added these metrics in your languages?

A80-grpc-metrics-for-tcp-connection Outdated Show resolved Hide resolved
A80-grpc-metrics-for-tcp-connection Outdated Show resolved Hide resolved
A80-grpc-metrics-for-tcp-connection Outdated Show resolved Hide resolved
@markdroth markdroth changed the title Create A80-grpc-metrics-for-tcp-connection A80: gRPC Metrics for TCP connection Apr 29, 2024
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need review from @ejona86 and @dfawley to make sure these definitions are still okay in terms of potential future implementation in other languages.

To improve the network debugging capabilities for gRPC users, we propose adding per-connection TCP metrics in gRPC. The metrics will utilize the metrics framework outlined in [A79].

### Related Proposals:
* [A79]: gRPC Non-Per-Call Metrics Framework (pending)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gRFC has been merged, so please remove the "(pending)" and change the lin to point to A79-non-per-call-metrics-architecture.md.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated.

| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Histogram (double) | s | grpc.tcp.peer_address, grpc.tcp.local_address | Records TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints. |
| grpc.tcp.delivery_rate | Histogram (double) | bit/s | grpc.tcp.peer_address, grpc.tcp.local_address | Records latest throughput measured of the TCP connection. |
| grpc.tcp.packets_sent | Counter (int64) | {packet} | grpc.tcp.peer_address, grpc.tcp.local_address | Records total packets TCP sends in the calculation period. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashykt What types should we be defining here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the Counter type to uint64 based on the register methods in metrics.h.
@yashykt Feel free to modify if it will use other types.

@nanahpang nanahpang requested review from ejona86 and dfawley and removed request for atollena May 10, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants