-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very long query times for get_auth_chain_difference_chains #17129
Comments
I believe we have the same problem. We noticed increased CPU usage after the upgrade from 1.104.0 to 1.105.1. This occurs occasionally for a few minutes. Homeserver: phys.ethz.ch
|
Same here on a semi-large homeserver deployment. Issues started with 1.105.0, get_auth_chain_difference_chains running for minutes instead of seconds. It persistet through 1.105.1: This evening I took a chance and reverted #17044 locally to see if it would help, but then federation breaks, so dont do it:
|
One more update: By reverting #17044, but keeping the class method in place for the security fix 55b0aa8, I now have a synapse that is back to normal for From what I have spotted in #17044 so far: The first usage of the query in the previous code:
In the new code, as the yield is at the end, I think step 2 and step 3 are switched (so first building links, then running difference_update, then step 2 from above). On the second usage, the I am not sure which of the two usages is causing the long running queries, but reverting both to the earlier version helped to get back to normal. From my limited understanding at least the |
It looks like the refactoring missed that the chains added by |
I've also partially reverted #17044 with apparent success. This is the diff: Diffdiff --git c/synapse/storage/databases/main/event_federation.py w/synapse/storage/databases/main/event_federation.py
index fb132ef09..97ac07c3a 100644
--- c/synapse/storage/databases/main/event_federation.py
+++ w/synapse/storage/databases/main/event_federation.py
@@ -280,16 +280,64 @@ class EventFederationWorkerStore(SignatureWorkerStore, EventsWorkerStore, SQLBas
# Now we look up all links for the chains we have, adding chains that
# are reachable from any event.
+ #
+ # This query is structured to first get all chain IDs reachable, and
+ # then pull out all links from those chains. This does pull out more
+ # rows than is strictly necessary, however there isn't a way of
+ # structuring the recursive part of query to pull out the links without
+ # also returning large quantities of redundant data (which can make it a
+ # lot slower).
+ sql = """
+ WITH RECURSIVE links(chain_id) AS (
+ SELECT
+ DISTINCT origin_chain_id
+ FROM event_auth_chain_links WHERE %s
+ UNION
+ SELECT
+ target_chain_id
+ FROM event_auth_chain_links
+ INNER JOIN links ON (chain_id = origin_chain_id)
+ )
+ SELECT
+ origin_chain_id, origin_sequence_number,
+ target_chain_id, target_sequence_number
+ FROM links
+ INNER JOIN event_auth_chain_links ON (chain_id = origin_chain_id)
+ """
# A map from chain ID to max sequence number *reachable* from any event ID.
chains: Dict[int, int] = {}
- for links in self._get_chain_links(txn, set(event_chains.keys())):
+
+ # Add all linked chains reachable from initial set of chains.
+ chains_to_fetch = set(event_chains.keys())
+ while chains_to_fetch:
+ batch2 = tuple(itertools.islice(chains_to_fetch, 1000))
+ chains_to_fetch.difference_update(batch2)
+ clause, args = make_in_list_sql_clause(
+ txn.database_engine, "origin_chain_id", batch2
+ )
+ txn.execute(sql % (clause,), args)
+
+ links: Dict[int, List[Tuple[int, int, int]]] = {}
+
+ for (
+ origin_chain_id,
+ origin_sequence_number,
+ target_chain_id,
+ target_sequence_number,
+ ) in txn:
+ links.setdefault(origin_chain_id, []).append(
+ (origin_sequence_number, target_chain_id, target_sequence_number)
+ )
+
for chain_id in links:
if chain_id not in event_chains:
continue
_materialize(chain_id, event_chains[chain_id], links, chains)
+ chains_to_fetch.difference_update(chains)
+
# Add the initial set of chains, excluding the sequence corresponding to
# initial event.
for chain_id, seq_no in event_chains.items():
@@ -579,16 +627,61 @@ class EventFederationWorkerStore(SignatureWorkerStore, EventsWorkerStore, SQLBas
# Now we look up all links for the chains we have, adding chains that
# are reachable from any event.
+ #
+ # This query is structured to first get all chain IDs reachable, and
+ # then pull out all links from those chains. This does pull out more
+ # rows than is strictly necessary, however there isn't a way of
+ # structuring the recursive part of query to pull out the links without
+ # also returning large quantities of redundant data (which can make it a
+ # lot slower).
+ sql = """
+ WITH RECURSIVE links(chain_id) AS (
+ SELECT
+ DISTINCT origin_chain_id
+ FROM event_auth_chain_links WHERE %s
+ UNION
+ SELECT
+ target_chain_id
+ FROM event_auth_chain_links
+ INNER JOIN links ON (chain_id = origin_chain_id)
+ )
+ SELECT
+ origin_chain_id, origin_sequence_number,
+ target_chain_id, target_sequence_number
+ FROM links
+ INNER JOIN event_auth_chain_links ON (chain_id = origin_chain_id)
+ """
+
+ # (We need to take a copy of `seen_chains` as we want to mutate it in
+ # the loop)
+ chains_to_fetch = set(seen_chains)
+ while chains_to_fetch:
+ batch2 = tuple(itertools.islice(chains_to_fetch, 1000))
+ clause, args = make_in_list_sql_clause(
+ txn.database_engine, "origin_chain_id", batch2
+ )
+ txn.execute(sql % (clause,), args)
+
+ links: Dict[int, List[Tuple[int, int, int]]] = {}
+
+ for (
+ origin_chain_id,
+ origin_sequence_number,
+ target_chain_id,
+ target_sequence_number,
+ ) in txn:
+ links.setdefault(origin_chain_id, []).append(
+ (origin_sequence_number, target_chain_id, target_sequence_number)
+ )
- # (We need to take a copy of `seen_chains` as the function mutates it)
- for links in self._get_chain_links(txn, set(seen_chains)):
for chains in set_to_chain:
for chain_id in links:
if chain_id not in chains:
continue
_materialize(chain_id, chains[chain_id], links, chains)
+ chains_to_fetch.difference_update(chains)
seen_chains.update(chains)
# Now for each chain we figure out the maximum sequence number reachable |
For details see element-hq#17129. Compared to a full revert, we keep the class method instroduced in element-hq#17129, as its used elsewhere by now Fixes: element-hq#17129 Signed-off-by: Christoph Settgast <csett86_git@quicksands.de>
For details see element-hq#17129. Compared to a full revert, we keep the class method instroduced in element-hq#17129, as its used elsewhere by now Fixes: element-hq#17129 Signed-off-by: Christoph Settgast <csett86_git@quicksands.de>
For details see element-hq#17129. Compared to a full revert, we keep the class method instroduced in element-hq#17129, as its used elsewhere by now Fixes: element-hq#17129 Signed-off-by: Christoph Settgast <csett86_git@quicksands.de>
Description
Since upgrading to v1.105.1 because of the security patch, we are seeing very high transaction times specifically for the transaction "get_auth_chain_difference_chains". In our case, we directly upgraded from v1.102.0 to v1.105.1, but heard from others with the same problem after upgrading from v1.104.0.
This could be a regression caused by #17044 which modified code called through several indirections by "get_auth_chain_difference_chains" or a regression caused by the security patches.
Steps to reproduce
Homeserver
kit.edu
Synapse Version
1.105.1
Installation Method
Debian packages from packages.matrix.org
Database
postgresql 13.14-0+deb11u1
Workers
Multiple workers
Platform
Synapse Main, Workers and the database all run in virtual machines.
Configuration
Presence is enabled.
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: