Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed use of subqueries in email analytics queries #19917

Merged

Conversation

kevinansfield
Copy link
Contributor

@kevinansfield kevinansfield commented Mar 25, 2024

closes https://linear.app/tryghost/issue/ENG-790/remove-use-of-sub-queries-in-email-analytics

Avoiding sub queries means we don't have a process tied up for longer than necessary and we can more easily see if one of the queries is non-performant.

  • extracted the count queries into separate queries and used the retrieved values in the final update query
  • removed a query by moving the email open rate calculation into JS as we've already fetched the necessary data before that point
  • optimized calculation of delivered_count by switching from IS NOT NULL to IS NULL to match the typical data stored in that column so it needs to read far fewer rows from the index when counting (see below for before/after)

delivered_at IS NOT NULL vs delivered_at IS NULL

mysql> explain analyze SELECT COUNT(id) FROM email_recipients WHERE email_id = '00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=47067 rows=1) (actual time=71.9..71.9 rows=1 loops=1)
    -> Filter: ((email_recipients.email_id = '00000000846f9cc01e7bd3cc') and (email_recipients.delivered_at is not null))  (cost=32464 rows=146030) (actual time=0.0552..69 rows=75682 loops=1)
        -> Covering index range scan on email_recipients using email_recipients_email_id_delivered_at_index over (email_id = '00000000846f9cc01e7bd3cc' AND NULL < delivered_at)  (cost=32464 rows=146030) (actual time=0.0511..43.2 rows=75682 loops=1)

mysql> explain analyze SELECT COUNT(id) FROM email_recipients WHERE email_id = '00000000846f9cc01e7bd3cc' AND delivered_at IS NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=1477 rows=1) (actual time=4.44..4.44 rows=1 loops=1)
    -> Filter: (email_recipients.delivered_at is null)  (cost=813 rows=6638) (actual time=0.23..4.23 rows=3593 loops=1)
        -> Covering index lookup on email_recipients using email_recipients_email_id_delivered_at_index (email_id='00000000846f9cc01e7bd3cc', delivered_at=NULL)  (cost=813 rows=6638) (actual time=0.229..3.9 rows=3593 loops=1)

@kevinansfield
Copy link
Contributor Author

There is some potential improvement here by optimising our use of IS NULL vs IS NOT NULL based on typical usage patterns. See #19918 (comment)

closes https://linear.app/tryghost/issue/ENG-790/remove-use-of-sub-queries-in-email-analytics

Avoiding sub queries means we don't have a process tied up for longer than necessary and we can more easily see if one of the queries is non-performant.

- extracted the count queries into separate queries and used the retrieved values in the final update query
- removed a query by moving the email open rate calculation into JS as we've already fetched the necessary data before that point
@kevinansfield kevinansfield force-pushed the remove-analytics-sub-queries branch 2 times, most recently from 95b4c32 to 5bd9254 Compare April 2, 2024 13:06
ref https://linear.app/tryghost/issue/ENG-790/remove-use-of-sub-queries-in-email-analytics

- the `delivered_at` column is typically entirely/nearly entirely filled with values meaning the `IS NOT NULL` query matches a huge number of rows that MySQL has to fetch from the index to count
- using `IS NULL` switches that behaviour around as it will now match very few rows which has been shown in testing to be considerably quicker
- after switching to `IS NULL` the query returns an "undelivered" count rather than a "delivered" count, in order to keep the rest of the system behaviour the same we can calculate the delivered count by subtracting the query result from the total number of emails sent which we can fetch using a very fast primary key lookup query on the `emails` table
@daniellockyer daniellockyer self-requested a review April 3, 2024 14:27
@kevinansfield kevinansfield merged commit bd93bf0 into TryGhost:main Apr 3, 2024
20 checks passed
@kevinansfield kevinansfield deleted the remove-analytics-sub-queries branch April 3, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants