Aeron Reliability #890

jgerman · 2019-04-26T11:10:18Z

We're having trouble with aeron exceptions in the onyx client. They are most often Client Conductor Timeouts though occasionally we see other aeron related exceptions. These exceptions kill the job between 1-4x per day (it's a long running job).

We can't seem to make these exceptions go away. GC does not appear to be an issue, nor do we see CPU usage spikes (our systems are running in GKE). Increasing CPU limits doesn't appear to help. The threads just seem to not be woken up in time to conduct their checks.

I'm pretty much stuck at this point trying various fixes while planning backup plans not involving Onyx. Any help to point me in the right direction would be appreciated.

neuromantik33 · 2019-04-26T11:42:28Z

We had a bunch of issues with aeron (and we also run in GKE 😉), and for long running jobs here is our de-facto settings, to be taken with a grain of salt I might add...

aeron.properties

# Timeout for client liveness in nanoseconds.
aeron.client.liveness.timeout=20000000000

# Timeout for image liveness in nanoseconds.
aeron.image.liveness.timeout=20000000000

# Increase the size of the maximum transmission unit to reduce system calls in a throughput scenario.
aeron.mtu.length=16384

# Set the initial window for flow control to account for BDP.
#aeron.rcv.initial.window.length=2097152

# Increase the size of OS socket receive buffer (SO_RCVBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_rcvbuf=2097152

# Increase the size of OS socket send buffer (SO_SNDBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_sndbuf=2097152

# Length (in bytes) of the log buffers for publication terms.
aeron.term.buffer.length=65536

# Do not use sparse files for the term buffers to avoid page faults.
aeron.term.buffer.sparse.file=true

# Disable bound checking to reduce instruction path on private secure networks.
agrona.disable.bounds.checks=true

and we run aeron is a sidecar container with the /dev/shm directory mounted as an in memory dir

...
- args:
- /oscaro/etc/aeron.properties
env:
- name: JAVA_OPTS
  value: -Xmx256m
- name: PROMETHEUS_METRICS_PORT
  value: "8091"
image: eu.gcr.io/oscaro-cloud/oscaro/aeron-driver:1.9.3-e678e95
imagePullPolicy: IfNotPresent
name: aeron
ports:
- containerPort: 40200
  protocol: TCP
- containerPort: 40200
  protocol: UDP
- containerPort: 8091
  name: aeron-metrics
  protocol: TCP
resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi
volumeMounts:
- mountPath: /oscaro/etc
  name: config
- mountPath: /dev/shm
  name: aeron
...
volumes:
- configMap:
    name: pipeline
  name: config
- emptyDir:
    medium: Memory
  name: aeron

Apparantly as was mentioned elsewhere, we should not be setting cpu limits to the container. We'll see what happens but for now it seems relatively stable even it is a bit of a dampening in terms of latency. We are unable to set the UDP buffers as we run our cluster on COS and it just doesn't allow as of 1.10 to change systemctl parameters within the pods.

jgerman · 2019-04-26T12:00:08Z

That's a ton of great information, thanks!

I was reluctant to increase settings like the liveness timeout (beyond our current 10 seconds) because I was afraid we were just masking the issue.

Did you confirm that the cpu throttling is your issue and you're just trying to mitigate at this point?

thenonameguy · 2019-04-26T12:23:17Z

Just for reference here are our aeron.properties:

aeron.socket.so_sndbuf=2097152
aeron.socket.so_rcvbuf=2097152
aeron.term.buffer.length=65536
aeron.image.liveness.timeout=10000000000
aeron.conductor.idle.strategy=org.agrona.concurrent.BusySpinIdleStrategy

We also spent countless hours trying to find a good configuration in archived Slack discussions, so this thread is much appreciated.

jgerman · 2019-04-29T17:32:32Z

We took our onyx cluster (well the 0.14 one) and isolated it into its own pool and dropped the CPU limits. It seems to have done the trick. I didn't want to jinx it over the weekend, but we've been running since Friday afternoon with no Aeron exceptions. Previously we couldn't go 24 hours without the exception and a killed job.

No matter which way you slice, even if we get the exception today this is a tremendous improvement.

jgerman mentioned this issue Apr 26, 2019

Project maintanence going forward #887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aeron Reliability #890

Aeron Reliability #890

jgerman commented Apr 26, 2019

neuromantik33 commented Apr 26, 2019

jgerman commented Apr 26, 2019

thenonameguy commented Apr 26, 2019 •

edited

jgerman commented Apr 29, 2019

Aeron Reliability #890

Aeron Reliability #890

Comments

jgerman commented Apr 26, 2019

neuromantik33 commented Apr 26, 2019

jgerman commented Apr 26, 2019

thenonameguy commented Apr 26, 2019 • edited

jgerman commented Apr 29, 2019

thenonameguy commented Apr 26, 2019 •

edited