CASSANDRA-19534 (5.0 patch): Unbounded queues in native transport requests lead to node instability #3274

ifesdjeen · 2024-04-29T17:21:26Z

No description provided.

src/java/org/apache/cassandra/transport/Dispatcher.java

src/java/org/apache/cassandra/concurrent/FutureTask.java

src/java/org/apache/cassandra/config/Config.java

src/java/org/apache/cassandra/config/DatabaseDescriptor.java

src/java/org/apache/cassandra/config/Config.java

src/java/org/apache/cassandra/config/DatabaseDescriptor.java

src/java/org/apache/cassandra/cql3/CQLStatement.java

src/java/org/apache/cassandra/db/MutationVerbHandler.java

maedhroz · 2024-05-07T01:32:54Z

src/java/org/apache/cassandra/db/ReadQuery.java

@@ -143,7 +144,7 @@ public ColumnFilter columnFilter()
     * @param state client state
     * @return the result of the query.
     */
-    public PartitionIterator execute(ConsistencyLevel consistency, ClientState state, long queryStartNanoTime) throws RequestExecutionException;
+    public PartitionIterator execute(ConsistencyLevel consistency, ClientState state, Dispatcher.RequestTime requestTime) throws RequestExecutionException;


nit: JavaDoc update while we're here?

src/java/org/apache/cassandra/net/Message.java

src/java/org/apache/cassandra/service/paxos/v1/AbstractPaxosCallback.java

src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java

src/java/org/apache/cassandra/service/reads/ReadCallback.java

maedhroz · 2024-05-07T16:13:24Z

src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java

@@ -34,6 +34,7 @@
 import org.apache.cassandra.locator.EndpointsForToken;
 import org.apache.cassandra.locator.ReplicaPlan;
 import org.apache.cassandra.locator.ReplicaPlan.ForWrite;
+import org.apache.cassandra.transport.Dispatcher;


nit: A bunch of imports out of place?

maedhroz · 2024-05-07T16:31:20Z

src/java/org/apache/cassandra/config/Config.java

+    public volatile CQLStartTime cql_start_time = CQLStartTime.REQUEST;
+
+    public boolean native_transport_throw_on_overload = false;
+    public double native_transport_queue_max_item_age_threshold = Double.MAX_VALUE;


nit: I want to try to work the token "ratio" into the name of this one, but haven't been able to come up w/ something concrete :D

maedhroz · 2024-05-07T16:33:17Z

src/java/org/apache/cassandra/tools/nodetool/DisableBinary.java

@@ -19,15 +19,19 @@

 import io.airlift.airline.Command;

+import io.airlift.airline.Option;


nit: Move up to be next to Command import

maedhroz · 2024-05-07T17:32:07Z

src/java/org/apache/cassandra/transport/CQLMessageHandler.java

@@ -178,7 +182,8 @@ protected boolean processOneContainedMessage(ShareableBytes bytes, Limit endpoin

        // max CQL message size defaults to 256mb, so should be safe to downcast
        int messageSize = Ints.checkedCast(header.bodySizeInBytes);


Not related to your patch, but I think I forgot to throw an @Override annotation on processOneContainedMessage()

src/java/org/apache/cassandra/transport/CQLMessageHandler.java

maedhroz · 2024-05-07T18:47:58Z

src/java/org/apache/cassandra/transport/PreV5Handlers.java

+
+                if (delay > 0)
+                {
+                    assert backpressure != Overload.NONE;


nit: Should this be just asserting that backpressure isn't NONE or should it be asserting that it's REQUESTS or QUEUE_TIME, which are the only things that would have an associated delay?

Right, so backpressure != Overload.NONE is tauthological at that spot. Changed to

assert backpressure == Overload.REQUESTS || backpressure == Overload.QUEUE_TIME : backpressure;

src/java/org/apache/cassandra/transport/PreV5Handlers.java

maedhroz · 2024-05-07T20:52:07Z

src/java/org/apache/cassandra/transport/Dispatcher.java

+        public String toString()
+        {
+            return "RequestProcessor{" +
+                   "request=" + request +


nit: Message has a toString(), but Request does not, so we'll miss Request.createdAtNanos...if that matters.

maedhroz · 2024-05-07T20:55:13Z

src/java/org/apache/cassandra/transport/Dispatcher.java

+        // query that is stuck behind the EXECUTE query, we would rather time it out and catch up with a backlog, expecting
+        // that the bursts are going to be short-lived.
+        ClientMetrics.instance.queueTime(queueTime, TimeUnit.NANOSECONDS);
+        if (queueTime > DatabaseDescriptor.getNativeTransportTimeout(TimeUnit.NANOSECONDS))


nit: This is a hot-ish (?) path. Would it make sene to memoize the native transport timeout so we don't have to call TimeUnit#covert() so much?

Did this:

private static long native_transport_timeout_nanos_cached = -1; public static long getNativeTransportTimeout(TimeUnit timeUnit) { if (timeUnit == TimeUnit.NANOSECONDS) { if (native_transport_timeout_nanos_cached == -1) native_transport_timeout_nanos_cached = conf.native_transport_timeout.to(TimeUnit.NANOSECONDS); return native_transport_timeout_nanos_cached; } return conf.native_transport_timeout.to(timeUnit); }

But arguably we should have a more generic pattern for these things. Maybe even always precompute millis nanos and micros. Should be extremely cheap, and constant time, if we use a tiny array.

maedhroz · 2024-05-07T21:16:11Z

src/java/org/apache/cassandra/transport/QueueBackpressure.java

+            // Continuing incident: apply backpressure but do not bump severity level yet
+            else if (appliedTimes < 10)
+            {
+                return new Impl(minDelayNanos, maxDelayNanos, now, severityLevel == 0 ? 1 : severityLevel, appliedTimes + 1);


nit: Can we just start the severityLevel at 1?

nit: It seems like there's a lot of Impl creation going on here during a spike. Is there any way we could perhaps moderate that a tiny bit by perhaps making appliedTimes an AtomicInteger?

Unfortunately because we have at least two variables that we need to update atomically, now and appliedTimes, we will either have to create some sort of object, or do some binary math (but then we lose precision). I am afraid I could not find a quick and easy way to make this more lightweight.

I would also like to highlight that as soon as we have applied timeout, we will have the client back-off for the given amount of time, so this might be less of a problem: we do this only if there is no capacity in the queue.

Also, incident does start at 1, could you elaborate?

I was just wondering why we had to do severityLevel == 0 ? 1 : severityLevel rather than just starting incidents at 1, but that means you have to check appliedAt or something in delay() to make sure you get zero before an incident has actually started.

src/java/org/apache/cassandra/transport/QueueBackpressure.java

…ated to a specific request * Add an ability to base _replica_ side queries on the queue tim * Use queue time as a base for message timeouts * Use native transport deadline for internode messages * Make sure that local runnables respect transport timeouts and deadlines * Make sure that remote mutation handler respects message expiration times

maedhroz · 2024-05-13T17:44:07Z

src/java/org/apache/cassandra/transport/Dispatcher.java

+            response.attach(request.connection);
+            FlushItem<?> toFlush = forFlusher.toFlushItem(channel, request, response);
+            flush(toFlush);
+            System.out.println(123123);


TODO: remove println

maedhroz · 2024-05-13T17:45:37Z

src/java/org/apache/cassandra/transport/PreV5Handlers.java


 import com.google.common.base.Predicate;

+import com.sun.jna.platform.win32.GL;


nit: unused import

ifesdjeen changed the title ~~CASSANDRA-19534~~ CASSANDRA-19534 (5.0 patch): Unbounded queues in native transport requests lead to node instability Apr 29, 2024

maedhroz reviewed May 3, 2024

View reviewed changes

src/java/org/apache/cassandra/transport/Dispatcher.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/concurrent/FutureTask.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/Config.java Outdated Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/Config.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/DatabaseDescriptor.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/DatabaseDescriptor.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/Config.java Outdated Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/config/DatabaseDescriptor.java Show resolved Hide resolved

maedhroz reviewed May 6, 2024

View reviewed changes

src/java/org/apache/cassandra/cql3/CQLStatement.java Outdated Show resolved Hide resolved