Incremented connection delay are not of the stated duration #2519

joh4nd · 2024-04-03T07:07:45Z

Elasticsearch Version

8.11.0

Installed Plugins

No response

Java Version

$ docker exec -u 0 -it container-name-here bash
root@9a26ef109662:/usr/share/elasticsearch# jdk/bin/java --version
openjdk 21.0.1 2023-10-17
OpenJDK Runtime Environment (build 21.0.1+12-29)
OpenJDK 64-Bit Server VM (build 21.0.1+12-29, mixed mode, sharing)

OS Version

Linux 9a26ef109662 6.5.11-linuxkit elastic/elasticsearch#1 SMP PREEMPT_DYNAMIC Wed Dec  6 17:14:50 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

The incremented connection delay described in the logs are not of the stated duration. Something like 1, 2, 4, 8, 16, 30, 30, 30, 30 that is something like a retry logic with an exponential backoff. Instead the connection aborts in a second or so. This happens inside just run docker containers (where the elasticsearch container is not fully up and running) based on official elasticsearch and python images using the py-elasticsearch client v. 8.12.1.

The behavior changed when I added an alternative delay in python, see https://stackoverflow.com/questions/78261325/why-does-flask-elasticsearch-timeout-duration-differ-between-docker-pause-and-do

To be sure, as soon as the containerized elasticsearch is completely up and running, the flask container is able to connect very quickly.

Steps to Reproduce

I followed steps described in these two tutorials:

except

I did not create a python venv because it was containerized.
I did not connect to localhost but to host.docker.internal.

Logs (if relevant)

where delay duration > 0.01s:

INFO:elastic_transport.transport:GET http://host.docker.internal:9200/ [status:N/A duration:0.004s]
WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(http://host.docker.internal:9200)> has failed for 10 times in a row, **putting on 30 second timeout**
WARNING:elastic_transport.transport:Retrying request after failure (attempt 9 of 12)
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/elastic_transport/_transport.py", line 342, in perform_request
    resp = node.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from None
elastic_transport.ConnectionError: Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

where delay duration ~ 10s:

INFO:elastic_transport.transport:GET http://host.docker.internal:9200/ [status:N/A duration:10.015s]
WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(http://host.docker.internal:9200)> has failed for 2 times in a row, **putting on 2 second timeout**
WARNING:elastic_transport.transport:Retrying request after failure (attempt 1 of 12)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/elastic_transport/_transport.py", line 342, in perform_request
    resp = node.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from None
elastic_transport.ConnectionTimeout: Connection timeout caused by: ReadTimeoutError(HTTPConnectionPool(host='host.docker.internal', port=9200): Read timed out. (**read timeout=10**))

The text was updated successfully, but these errors were encountered:

pquentin · 2024-04-26T13:40:56Z

Hello, and sorry for the delay. Despite the various details, I'm lacking a https://stackoverflow.com/help/minimal-reproducible-example to properly analyse your problem. However, I think the explanation below may be enough.

Timeouts are the maximum amount of time to perform an operation, but if the connection outright fails, because Elasticsearch hasn't started listening to its port, then the connection will fail instantly. And (crucially) we don't currently sleep on retries, so you can easily get 10 successive connection attempts that all fail, the whole attempt finishing after 100ms.

Now, in the second case, my understanding from your StackOverflow post is that you used docker pause. In that case, the port is opened and technically Elasticsearch is listening to that port? However, since the Elasticsearch process is paused, it can't acknowledge the connection, and in that case we will indeed take 10 seconds per attempt.

So my understanding is that if we added proper exponential back-off (with sleep) between connection retries, then your issue would be solved. In the meantime, depends_on is the way to go.

Does that sound right?

pquentin · 2024-05-17T06:01:06Z

Closing as I've not heard back from you.

joh4nd · 2024-06-05T11:43:28Z

Hello @pquentin thanks for getting back and apologies for not responding (was AFK).

You are probably right to distinguish operation from connection. And it was probably a failure on my part to understand the logs, but it seemed like it threw errors that, if I understand you correctly, are not really the error. That is, the error was of connection and not operation although the log contained errors about operation? In that case, maybe the error messages thrown by the ES py client could be made more informative.
Btw, I thought the officials ES tutorials would do as minimal reproducible examples.
Though it is some time ago now, I believe I fixed the issue with the code below.

# github.com/joh4nd/mes-search/blob/main/src/pipeline/es_client.py

class Search:
    
    def __init__(self):

        self.es = Elasticsearch('http://host.docker.internal:9200')

        retries=0
        max_retries=12
        delay=10
        
        while retries < max_retries:
            try:
                # time.sleep(10)
                client_info = self.es.info()
                logging.info("Connected to Elasticsearch!")
                logging.info(json.dumps(client_info.body, indent = 4))
                break
            except Exception as e:
                time.sleep(delay)
                retries += 1
                logging.info(f"Waiting {retries}/{max_retries} {delay} more seconds for Elasticsearch...")
                if retries == max_retries:
                    raise e

DaveCTurner transferred this issue from elastic/elasticsearch Apr 3, 2024

pquentin added Category: Enhancement Area: Transport labels Apr 26, 2024

pquentin closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremented connection delay are not of the stated duration #2519

Incremented connection delay are not of the stated duration #2519

joh4nd commented Apr 3, 2024 •

edited

pquentin commented Apr 26, 2024

pquentin commented May 17, 2024

joh4nd commented Jun 5, 2024

Incremented connection delay are not of the stated duration #2519

Incremented connection delay are not of the stated duration #2519

Comments

joh4nd commented Apr 3, 2024 • edited

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

pquentin commented Apr 26, 2024

pquentin commented May 17, 2024

joh4nd commented Jun 5, 2024

joh4nd commented Apr 3, 2024 •

edited