Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to send multiple log messages in a single HTTP request #332

Open
bgoerzig opened this issue Jun 21, 2023 · 13 comments
Open

Allow to send multiple log messages in a single HTTP request #332

bgoerzig opened this issue Jun 21, 2023 · 13 comments

Comments

@bgoerzig
Copy link

bgoerzig commented Jun 21, 2023

Problem Statement

The maximum throughput for https-based syslog drains is limited by the message roundtrip time:

max_logs_per_second = 1 / message_roundtrip_time_in_seconds.

This is a consequence of the syslog agent using one connection per drain and sending http-requests sequentially.
This limits the throughput especially in high latency scenarios. Messages are buffered if the message rate exceeds the throughput. Once the the syslog agent buffer is full, all messages are dropped.

Proposal

To increase the throughput and avoid loss we propose to introduce an option that allows the agent to batch multiple log messages into a single HTTP request.

Pull Request by @nicklas-dohrn: #491

@bgoerzig @juergen-walter @nicklas-dohrn

@bgoerzig
Copy link
Author

bgoerzig commented Jun 21, 2023

We are open to contribute this feature if our proposal is accepted. Since the syslog RFC 5424 doesn't specify a batched format we have to define a structure that is accepted by the community. From our point of view, the easiest implementation would be to use multi-line batches.

@ctlong ctlong moved this from Reviewer Assigned to Inbox in DEPRECATED App Platform - Logging and Metrics Jun 21, 2023
@ctlong
Copy link
Member

ctlong commented Jun 21, 2023

Hi @bgoerzig, thanks for taking the time to write up this proposal. I think https is the least optimal of the schemes that Syslog Agent offers for streaming syslog messages. Because of that we typically encourage use of the other schemes. May I ask why you're interested in using this scheme in particular?

My first take is that we would be open to a PR to change the https scheme to batch messages, as long as the change is feature-flagged. AFAIK there's no RFC for syslog over HTTPS, so our implementation of it is already special and there's nothing stopping us from making it more special. Feature-flagging the change would be necessary because otherwise we're talking about a breaking change for pre-existing https drains. Multi-line seems like a reasonable implementation approach as well.

I need to re-read up on these schemes, I can no longer recall exactly why https is less optimal than the others 😅.

@ctlong ctlong moved this from Inbox to Review in Progress in DEPRECATED App Platform - Logging and Metrics Jun 21, 2023
@chombium
Copy link
Contributor

chombium commented Jun 22, 2023

We've discussed the same issue in our team when @fhambrec and @dark5un were doing Syslog Agent performance tests. The difference between the unencrypted and encrypted traffic and the protocol https vs syslog-tls is significant. That's why at the end the recommendation is to use syslog-tls scheme for higher loads.
We've talked about the possible improvements, thought about batching messages in single requests as well, but the we think that's not the way to go, as there is nothing defined in the Syslog RFC and the client (Syslog Agent in our case) and the server (generally speaking, unknown 3rd party Syslog server) would both have to know how to pack and unpack the request body. As we only control the client in this case, we've concluded that this solution is not the right one.

At the end, our discussion went into direction of refactoring Syslog Agent's egress so that we have roughly speaking a connection poll with load balancing. The Syslog agent opens multiple connections and sends the messages in parallel. We can revive topic in our team ans see what can we do about it.

@ctlong ctlong moved this from Review in Progress to Issue - Triage complete. Needs fix. in DEPRECATED App Platform - Logging and Metrics Jun 22, 2023
@stefanlay
Copy link
Member

stefanlay commented Jun 26, 2023

I need to re-read up on these schemes, I can no longer recall exactly why https is less optimal than the others

The reason that http drains perform badly is that the requests are sent one after the other over one tcp connection. The throughput is therefore 1/(roundtrip time) which is less than 200/s inside an AWS datacenter. We reproduced that in performance tests. With syslog > 4000/s could be sent (independent if it is with (m)tls or not).

I see three options to overcome this, maybe there are more:

  1. Using more than one connection for an http sylog drain.
  • Advantage: Server behaviour unchanged, they can just receive one line per request like currently
  • Disadvantage: Handling of connection pool, unclear how big it should be; order of logs may get lost
  1. Batching the requests, sending all available requests via one http request
  • Advantage: (Relatively) simple to implement on syslog agent side
  • Disadvantage: Non-standard, therefore adaption on server side needed
  1. Using http2
  • Advantage: Standard way to de-serialize http request over one connection, simple to implement on syslog agent side (just using https://pkg.go.dev/golang.org/x/net/http2)
  • Disadvantage: Needs to be supported on server side. I do not have experience there.

I would favor http2. We had already planned to look into it on SAP side. This is in my opinion the most standard way. We could e.g. add the scheme "https2" to the list addiotionally to https, syslog and syslog+tls

My first take is that we would be open to a PR to change the https scheme to batch messages, as long as the change is feature-flagged.

I would go one step further and would request that the feature is per drain. We are running multi-tenant foundations and cannot break existing customers using an http syslog drain. Maybe an option would be to configure it with a query parameter in the syslog drain url (like the drain-type).

@ctlong
Copy link
Member

ctlong commented Jun 26, 2023

Thank you for the refresher @stefanlay!

I would favor http2.

I agree with this. Using http2 seems the more "standard" way to pump up the performance of this drain option.

I would go one step further and would request that the feature is per drain... Maybe an option would be to configure it with a query parameter in the syslog drain url (like the drain-type).

That's a very reasonable request. I'm in favor of this as well.

@Benjamintf1
Copy link
Member

Howdy!

Just want to inject some context and thoughts.

Firstly, honestly speaking, I'm very much not opposed to adding the ability to batch into the syslog-agent code base. Not only do i suspect that batching will give us huge performance gains for https drains, if and when we start to look more into adding additional protocol support to the drain-agent(syslog-agent) we'd have a reusable consistant tool which we could easily add to other interfaces that require or would benefit from batching(such as for example, otlp).

Back in cf-for-k8s times, we spent a lot of time looking into our path to shared nothing for log-cache. As we were working on deprecating the firehose and moving over to syslog, and we saw that the performance of syslog ingress for log-cache was somewhat abysmal. The syslog server could accept traffic quite rapidly and efficiently but the GRPC throughput from syslog-server to log-cache was horrendous. The way that we fixed that performance bottleneck is to batch envelopes from the syslog-server to the log-cache rather then send them one by one. I think it's easy to look at this and apply the same lesson to http drains, and while performance testing seems reasonable, I highly suspect based on this that our performance will show marked increase.

The main reason I feel like we have not pursued this is just the general lack of priority and interest in https drains in general. Https drains are I think still largely something I see as a "proof of concept" that happens to be present in production code. I'm not aware of a rfc supported method to send syslog over https in general. It seems like we "send syslog formatted messages" over https, and so does heroku(!?!?!?), but you'd think anyone who uses https as their interface would prefer fully formatted json messages. As I say this, there's some point about the concern of adding performance cost to the drain agent from marshalling messages into json(something we've seen can be extremely costly), but I digress. As long as we're shoving syslog formatted messages into a json message, https feels a bit awkward to me.

There's a second thing that makes it feel like a proof of concept interface. Internally, the model use case for http drains is and somewhat has been sending logs and metrics to a aws lambda service. Now I did a little research and one thing I noticed is that lambda doesn't support http2. Now this isn't necessarily a problem, as far as I understand no actual user has set up a aws lambda service to accept their logs and metrics. One major problem is I'd suspect such a use-case would be extremely extremely expensive. But I'd be interested in some sort of mental model and use-case explanation to replace the existing one I have, and I wonder if in pursuing that mental model and use-case we could come up with possibly more acceptable models and solutions. Right now we're talking about how to better support performance with http, and we might go into a solution workflow where we do continue to support https, or support http/2(s), but I could imagine us also talking about other protocols that still don't require tcp level routing but still would be a good community standard, for example.

Now as for the configurability of batching, I really do think that unconfigurable could and probubly would be fine. We've gotten away with universal batching size for v2 loggregator for a long time now, and in fact, not configuring the size of the batches also means we don't have to configure the size of our grpc buffers either(although we did have to change them to be accurate). I don't even think we have to specifically tune them per say to be the most performant, I think in many cases good enough will be good enough. The biggest risk in my mind would be users who expect 1 envelope and suddenly start receiving say, 100, and stop processing lots of envelopes.

Let me know what you think though.

@nicklas-dohrn
Copy link

Sorry for taking so long to move on this topic, we now have a draft proposal ready to talk about this improvement once more.

We did some testing on our side with a custom built loggr-syslog-agent binary which contains the changes I provided in this draft PR:
#491

These changes only implement the syslog-batching approach we described above, with a pretty minimal and non invasive code change to be enabled via query parameter for a user-provided-service.

Please let me know what you think of this approach, and how we could improve it to make it a contributable part of the coming loggregator versions.

Greetings, Nicklas

@ctlong
Copy link
Member

ctlong commented May 3, 2024

It's probably not going to have as big an impact as batching, but I found that we were not using fasthttp appropriately in our HTTPSWriter. That may have had a negative impact on the performance of Syslog Agent's HTTPS drains. Should be resolved in this PR: #573.

@ctlong
Copy link
Member

ctlong commented May 3, 2024

@nicklas-dohrn I've re-read through this issue and the current implementation of Syslog Agent, and I think #491 is slightly off track.

Instead of using a new batching=true query param in the syslog drain URL to indicate to the HTTPSWriter that batching is enabled, a new protocol should be defined. I like https-batch, but open to other naming schemes. Then syslog drain URLs matching https-batch://..., for example, will be redirected to a HTTPSBatchWriter, or equivalent, that will batch messages over HTTPS using \n as a separator.

The reason I think a new protocol would be better is twofold:

  • The current implementation doesn't have other examples of using query params to change the behaviour of the writers. That all happens at higher levels.
  • Conceptually, we are defining a new communication protocol here: multiple syslog messages separated by newlines delivered in the same HTTPS request.

@nicklas-dohrn
Copy link

nicklas-dohrn commented May 6, 2024

Yes, looking from a more broad perspective, I see the appeal of having it split out from a higher up point in the call chain.
I will look into how I can incorporate this way of switching Writers, but from thinking about it from the top of my head right now, this should be rather straight forward to be implemented.

I refactored all the respective code, and the approach looks rather appealing so far.
Let me know what you think so far.

@ctlong
Copy link
Member

ctlong commented May 6, 2024

It's probably not going to have as big an impact as batching, but I found that we were not using fasthttp appropriately in our HTTPSWriter. That may have had a negative impact on the performance of Syslog Agent's HTTPS drains. Should be resolved in this PR: #573.

@nicklas-dohrn and other folks on this issue, what do you think of this change?

@nicklas-dohrn
Copy link

I think that might prove helpful on the issue, but will for sure not resolve our issues, as we are currently allowing cross consumption into different availability zones, like from US to EU or else.
Sending single log entries with single https requests is depending on the size of the logs mostly inefficient, and might even be costly on datacenters with high IO costs.
We also have seen a positive impact on our routing infrastructure, as having bundled logs sent through nginx is way more efficient.

I will nonetheless try out this approach with my dev setup and see how it performs in our usecase, as the only comparison I have found so far is:
https://github.com/nurmohammed840/web-benchmark?tab=readme-ov-file#go-v116-linux

From what I guess so far, is that it would increase the throuput probably quite substantially, but I doubt it will increase by more than 200%, and we unfortunately likely need way more than that (at least 5x).

but I will come back to you with my results, once I have some :)

@ctlong
Copy link
Member

ctlong commented May 7, 2024

Thank you!

I understand, the batching PR is still desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes | Open for Contribution
DEPRECATED App Platform - Logging and...
Issue - Triage complete. Needs fix.
Development

No branches or pull requests

6 participants