Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the loadbalancers servers order random #9037

Merged
merged 8 commits into from Sep 14, 2022

Conversation

qmloong
Copy link
Contributor

@qmloong qmloong commented May 23, 2022

What does this PR do?

Fix the following issue:

TCP/UDP WRR will always pick the first server when setting up.
This may cause load unbalance when
we use traefik as forward tcp proxy and the upper application use tcp keep-alive connection.

This is a classic LB issue, the issue is that the first index servers will receive larger traffic during startup.

For example, I have, 2 clients (C), 3 traefik (T) and 3 upstream server (U).

Clients(1,2) → Traefik(1,2,3) → Upstream Severs)

1: C1 → T1 → U1

2: C2 → T2 → U1

3: C3 → T3 → U1

Motivation

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

This PR might mitigate #8275.

Co-authored-by: Mathieu Lonjaret mathieu.lonjaret@gmail.com
Co-authored-by: Romain rtribotte@users.noreply.github.com

@rtribotte
Copy link
Member

Hello @qmloong,

Thanks for your contribution!

At first glance, we think that the index shouldn't be leveraged to introduce entropy in server selection.
We would prefer to insert a server in the list at a random position when it is added, instead of appending it to the list.

Do you want to take care of it, or shall we do it?

@qmloong
Copy link
Contributor Author

qmloong commented May 23, 2022

We would prefer to insert a server in the list at a random position when it is added, instead of appending it to the list.

In this way, the user experience is not very good for a large number of traefik to do forward proxy clusters...Especially automated configuration or deployment for traefik.

we think that the index shouldn't be leveraged to introduce entropy in server selection.

BTW, what's wrong with doing this?

Refer to:
https://github.com/grpc/grpc-go/blob/30b9d59a766858a4b51148e47edb3af2766ab617/balancer/roundrobin/roundrobin.go#L63

@rtribotte

@rtribotte
Copy link
Member

rtribotte commented May 23, 2022

In this way, the user experience is not very good for a large number of traefik to do forward proxy clusters...Especially automated configuration or deployment for traefik.

The server list is the result of applying the dynamic configuration, so it doesn't affect the configuration itself. What do you mean by "not very good user experience" in that case?

BTW, what's wrong with doing this?

The problem lies in those lines:

b.index = (b.index + 1) % len(b.servers)
if b.index == 0 {
b.currentWeight -= gcd
if b.currentWeight <= 0 {
b.currentWeight = max
}
}

The index value -1 is "used" to set the value of the current weight on the first iteration. Changing the index value will break that condition on the first iteration when the random value is different from -1, and allow the selection of any server of any weight. Leveraging the index value to add entropy brings additional complexity compared to randomly changing the order of servers in the list.

@qmloong
Copy link
Contributor Author

qmloong commented May 23, 2022

What do you mean by "not very good user experience" in that case?

For instance, we use traefik as the forward proxy for kubelet (1 : 1) as local proxy to access to 3 kube-apiserver instance A, B, C
If my cluster has 5k nodes(in other words, we have 5k traefik), then I need make 5k dynamic configurations of traefik in different order.. It's not so good right? If traefik picks server in random, so we can deploy traefik in the same dynamic configuration in ansible playbook or any othor tools.

The index value -1 is "used" to set the value of the current weight on the first iteration. Changing the index value will break that condition on the first iteration when the random value is different from -1, and allow the selection of any server of any weight.

I had known this issure.. I just ignored the first iteration. But I can fix it if the above scenario is reasonable.

@rtribotte
Copy link
Member

For instance, we use traefik as the forward proxy for kubelet (1 : 1) as local proxy to access to 3 kube-apiserver instance A, B, C
If my cluster has 5k nodes(in other words, we have 5k traefik), then I need make 5k dynamic configurations of traefik in different order.. It's not so good right? If traefik picks server in random, so we can deploy traefik in the same dynamic configuration in ansible playbook or any othor tools.

Computing the random index will also happen on every Traefik instance independently, so the discrepancy is expected, as well as having the list of servers in a different order on each instance.
Thus, if you are raising concerns regarding performances, the impact will not be on the "critical path" (during the routing), and then, adding the server to the list at a random index shouldn't be a concern.

I had known this issure.. I just ignored the first iteration. But I can fix it if the above scenario is reasonable.

At this point, we're not sure we're making ourselves clear. So maybe we should add the commit ourselves to the PR so you can understand/see better? Is that ok for you?

Copy link
Member

@rtribotte rtribotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

@rtribotte rtribotte changed the title feat: use rand index for the first time picking in tcp/udp wrr Make the loadbalancers servers order random Sep 2, 2022
pkg/server/service/service.go Outdated Show resolved Hide resolved
integration/retry_test.go Outdated Show resolved Hide resolved
integration/retry_test.go Outdated Show resolved Hide resolved
pkg/server/service/service.go Outdated Show resolved Hide resolved
pkg/server/service/tcp/service.go Outdated Show resolved Hide resolved
integration/retry_test.go Outdated Show resolved Hide resolved
@ldez ldez removed this from the 2.9 milestone Sep 13, 2022
@ldez ldez force-pushed the bugfix/wrr-first-index branch 2 times, most recently from f29107f to f5548b5 Compare September 14, 2022 09:48
@ldez ldez added this to the 2.9 milestone Sep 14, 2022
Copy link
Member

@kevinpollet kevinpollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@traefiker traefiker merged commit 788f8fa into traefik:master Sep 14, 2022
v2 automation moved this from To review to Done Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
v2
Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants