Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremented connection delay are not of the stated duration #2519

Closed
joh4nd opened this issue Apr 3, 2024 · 3 comments
Closed

Incremented connection delay are not of the stated duration #2519

joh4nd opened this issue Apr 3, 2024 · 3 comments

Comments

@joh4nd
Copy link

joh4nd commented Apr 3, 2024

Elasticsearch Version

8.11.0

Installed Plugins

No response

Java Version

$ docker exec -u 0 -it container-name-here bash
root@9a26ef109662:/usr/share/elasticsearch# jdk/bin/java --version
openjdk 21.0.1 2023-10-17
OpenJDK Runtime Environment (build 21.0.1+12-29)
OpenJDK 64-Bit Server VM (build 21.0.1+12-29, mixed mode, sharing)

OS Version

Linux 9a26ef109662 6.5.11-linuxkit elastic/elasticsearch#1 SMP PREEMPT_DYNAMIC Wed Dec  6 17:14:50 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

The incremented connection delay described in the logs are not of the stated duration. Something like 1, 2, 4, 8, 16, 30, 30, 30, 30 that is something like a retry logic with an exponential backoff. Instead the connection aborts in a second or so. This happens inside just run docker containers (where the elasticsearch container is not fully up and running) based on official elasticsearch and python images using the py-elasticsearch client v. 8.12.1.

The behavior changed when I added an alternative delay in python, see https://stackoverflow.com/questions/78261325/why-does-flask-elasticsearch-timeout-duration-differ-between-docker-pause-and-do

To be sure, as soon as the containerized elasticsearch is completely up and running, the flask container is able to connect very quickly.

Steps to Reproduce

I followed steps described in these two tutorials:

except

  • I did not create a python venv because it was containerized.
  • I did not connect to localhost but to host.docker.internal.

Logs (if relevant)

where delay duration > 0.01s:

INFO:elastic_transport.transport:GET http://host.docker.internal:9200/ [status:N/A duration:0.004s]
WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(http://host.docker.internal:9200)> has failed for 10 times in a row, **putting on 30 second timeout**
WARNING:elastic_transport.transport:Retrying request after failure (attempt 9 of 12)
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/elastic_transport/_transport.py", line 342, in perform_request
    resp = node.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from None
elastic_transport.ConnectionError: Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

where delay duration ~ 10s:

INFO:elastic_transport.transport:GET http://host.docker.internal:9200/ [status:N/A duration:10.015s]
WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(http://host.docker.internal:9200)> has failed for 2 times in a row, **putting on 2 second timeout**
WARNING:elastic_transport.transport:Retrying request after failure (attempt 1 of 12)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/elastic_transport/_transport.py", line 342, in perform_request
    resp = node.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from None
elastic_transport.ConnectionTimeout: Connection timeout caused by: ReadTimeoutError(HTTPConnectionPool(host='host.docker.internal', port=9200): Read timed out. (**read timeout=10**))
@DaveCTurner DaveCTurner transferred this issue from elastic/elasticsearch Apr 3, 2024
@pquentin
Copy link
Member

Hello, and sorry for the delay. Despite the various details, I'm lacking a https://stackoverflow.com/help/minimal-reproducible-example to properly analyse your problem. However, I think the explanation below may be enough.

Timeouts are the maximum amount of time to perform an operation, but if the connection outright fails, because Elasticsearch hasn't started listening to its port, then the connection will fail instantly. And (crucially) we don't currently sleep on retries, so you can easily get 10 successive connection attempts that all fail, the whole attempt finishing after 100ms.

Now, in the second case, my understanding from your StackOverflow post is that you used docker pause. In that case, the port is opened and technically Elasticsearch is listening to that port? However, since the Elasticsearch process is paused, it can't acknowledge the connection, and in that case we will indeed take 10 seconds per attempt.

So my understanding is that if we added proper exponential back-off (with sleep) between connection retries, then your issue would be solved. In the meantime, depends_on is the way to go.

Does that sound right?

@pquentin
Copy link
Member

Closing as I've not heard back from you.

@joh4nd
Copy link
Author

joh4nd commented Jun 5, 2024

Hello @pquentin thanks for getting back and apologies for not responding (was AFK).

  1. You are probably right to distinguish operation from connection. And it was probably a failure on my part to understand the logs, but it seemed like it threw errors that, if I understand you correctly, are not really the error. That is, the error was of connection and not operation although the log contained errors about operation? In that case, maybe the error messages thrown by the ES py client could be made more informative.

  2. Btw, I thought the officials ES tutorials would do as minimal reproducible examples.

  3. Though it is some time ago now, I believe I fixed the issue with the code below.

# github.com/joh4nd/mes-search/blob/main/src/pipeline/es_client.py

class Search:
    
    def __init__(self):

        self.es = Elasticsearch('http://host.docker.internal:9200')

        retries=0
        max_retries=12
        delay=10
        
        while retries < max_retries:
            try:
                # time.sleep(10)
                client_info = self.es.info()
                logging.info("Connected to Elasticsearch!")
                logging.info(json.dumps(client_info.body, indent = 4))
                break
            except Exception as e:
                time.sleep(delay)
                retries += 1
                logging.info(f"Waiting {retries}/{max_retries} {delay} more seconds for Elasticsearch...")
                if retries == max_retries:
                    raise e


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants