-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework relay #9968
Labels
Comments
Gerold103
added a commit
that referenced
this issue
Apr 19, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in #9968. Closes #9094 NO_DOC=bugfix
Gerold103
added a commit
that referenced
this issue
Apr 26, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in #9968. Closes #9094 NO_DOC=bugfix
sergepetrenko
pushed a commit
that referenced
this issue
Apr 27, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in #9968. Closes #9094 NO_DOC=bugfix
sergepetrenko
pushed a commit
to sergepetrenko/tarantool
that referenced
this issue
Apr 27, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in tarantool#9968. Closes tarantool#9094 NO_DOC=bugfix (cherry picked from commit f7e6686)
sergepetrenko
pushed a commit
to sergepetrenko/tarantool
that referenced
this issue
Apr 27, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in tarantool#9968. Closes tarantool#9094 NO_DOC=bugfix (cherry picked from commit f7e6686)
This was referenced Apr 27, 2024
sergepetrenko
pushed a commit
to sergepetrenko/tarantool
that referenced
this issue
May 8, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in tarantool#9968. Closes tarantool#9094 NO_DOC=bugfix (cherry picked from commit f7e6686)
sergepetrenko
pushed a commit
that referenced
this issue
May 8, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in #9968. Closes #9094 NO_DOC=bugfix (cherry picked from commit f7e6686)
sergepetrenko
pushed a commit
that referenced
this issue
May 8, 2024
When a replica subscribes, it might in the beginning try to position its reader cursor to the end of a large xlog file. Positioning inside of this file can take significant time during which the WAL reader yielded and tried to send heartbeats, but couldn't, because the relay thread wasn't communicating with the TX thread. When there are no messages from TX for too long time, the heartbeats to the replica are not being sent (commit 56571d8 ("raft: make followers notice leader hang")). The relay must communicate with the TX thread even when subscribe is just being started and opens a large xlog file. This isn't the first time when the missing heartbeats result into timeouts. See more here: - commit 30ad4a5 ("relay: yield explicitly every N sent rows"). - commit 1728944 ("recovery: make it yield when positioning in a WAL"). - commit ee6de02 ("relay: send heartbeats while reading a WAL"). Given that this is fixed fourth time, it might suggest that the relay has not the best architecture having some slight drawbacks. See more in #9968. Closes #9094 NO_DOC=bugfix (cherry picked from commit f7e6686)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The relay right now is a quite sad state.
struct relay
is hidden in .cc file and even the simplest access to it has to go via a function).struct relay
members are not needed. For example,wal_endpoint
,tx_endpoint
, andwal_watcher
could easily exist just insiderelay_subscribe_f
- they are used only by it, could be there on the stack.relay_subscribe_f
callsstruct fiber *reader = fiber_new_xc(name, relay_reader_f);
, but if it throws an exception, then many things will not be cleaned up from above this line and will most likely just crash.struct relay
TX members must be moved intostruct replica
. Relay can't access them anyway from its threads. There must be clear separation of objects. One struct living in 2 threads at once makes it hard to protect from races.Relay at this point seems beyond microfix-saving and has to be reworked quite significantly. A good start would be to consider the architecture of one fiber per one job, so one fiber for TX messages processing, one fiber for socket reading, one fiber for socket writing, one fiber for WAL event processing. No crutches with WAL yield callback doing heartbeats and TX processing at once, and hacks to avoid accidental sending of a heartbeat during sending a TX. This won't be needed if the writer-fiber would take whole packs of data to send, and send them linearly.
As a buffer the senders (heartbeats and Raft messages from TX-processor-fiber, and txns from WAL-processor-fiber) could have their own lsregions and would be notified when their next data pack was sent, to propagate their lsregions. Callback-wise it would be similar to how
boost::asio::async_send
works.The text was updated successfully, but these errors were encountered: