Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

Statistics are abysmally slow with large databases #251

Open
kescherCode opened this issue Jul 9, 2020 · 3 comments
Open

Statistics are abysmally slow with large databases #251

kescherCode opened this issue Jul 9, 2020 · 3 comments

Comments

@kescherCode
Copy link

kescherCode commented Jul 9, 2020

At least when using an SQLite database, statistics tend to be really slow when said database reaches a large size.

This is for 12 hours:
image

Generating it for 12 months is probably going to take longer than I'd like to keep my computer on.

7 Minutes for 12 hours of statistics, including the high I/O usage that comes with this, seems awfully long. Maybe we should start periodically generating statistics into another table or even a new database each hour, as well as find a way to generate hourly statistics from all the old data as well.

I'm aware that users usually don't have as large of a database as I do (41 GiB currently), and that magnetico is currently explicitly not designed to scale, but I would still leave this up to consideration.

@kescherCode
Copy link
Author

Alternatively, I might look into building a seperate statistics builder application.

@z3r0byt3s
Copy link

[RELEASES: 31,721,867]-[FILES: 522,209,821]

Re-generated every 1 hour on nvme disk - took only 10-15 minutes

@black-puppydog
Copy link

black-puppydog commented Jul 20, 2021

So from what I can see, the query for the statistics (for sqlite3) is built in sqlite3.go:

// TODO: make it faster!
rows, err := db.conn.Query(fmt.Sprintf(`
SELECT strftime('%s', discovered_on, 'unixepoch') AS dT
, sum(files.size) AS tS
, count(DISTINCT torrents.id) AS nD
, count(DISTINCT files.id) AS nF
FROM torrents, files
WHERE torrents.id = files.torrent_id
AND discovered_on >= ?
AND discovered_on <= ?
GROUP BY dt;`,
timef),
fromTime.Unix(), toTime.Unix())

Note the TODO item. 😆

I built myself a query as it would be run for today, using one of the arbitrary date formats:

SELECT strftime('%Y-%m-%d', discovered_on, 'unixepoch') AS dT
, sum(files.size) AS tS
, count(DISTINCT torrents.id) AS nD              
, count(DISTINCT files.id) AS nF
FROM torrents, files
WHERE     torrents.id = files.torrent_id
AND discovered_on >= 1626180197 -- this is one week ago
AND discovered_on <= 1626784997 -- this is today
GROUP BY dt;

And indeed it is quite slow on a 34GB database.
So I prefixed the whole thing with EXPLAIN QUERY PLAN and voilà:

QUERY PLAN
|--SCAN TABLE files
|--SEARCH TABLE torrents USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY

It seems to read the entire files table?! That would certainly explain the slowness.
I tried adding an index to speed up the matching up of torrents.id and files.torrent_id in the WHERE clause:

CREATE INDEX torrent_file_index ON files (torrent_id);

But that didn't change the query plan at all. I then also replaced the WHERE formulation with an INNER JOIN but again got no change. Not sure why though, I don't typically do this kind of DB work.
I'd guess the "best" would be to add a file_count field into the torrents table, here:

res, err := tx.Exec(`
INSERT INTO torrents (
info_hash,
name,
total_size,
discovered_on
) VALUES (?, ?, ?, ?);
`, infoHash, name, totalSize, time.Now().Unix())

We already have total_size (not sure why the query above uses a sum(files.size) instead) and with file_count we'd be able to completely forego the files table for this query. Of course, that would mean a schema migration... 😐

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants