Statistics are abysmally slow with large databases #251

kescherCode · 2020-07-09T11:24:29Z

At least when using an SQLite database, statistics tend to be really slow when said database reaches a large size.

This is for 12 hours:

Generating it for 12 months is probably going to take longer than I'd like to keep my computer on.

7 Minutes for 12 hours of statistics, including the high I/O usage that comes with this, seems awfully long. Maybe we should start periodically generating statistics into another table or even a new database each hour, as well as find a way to generate hourly statistics from all the old data as well.

I'm aware that users usually don't have as large of a database as I do (41 GiB currently), and that magnetico is currently explicitly not designed to scale, but I would still leave this up to consideration.

kescherCode · 2020-07-09T11:25:27Z

Alternatively, I might look into building a seperate statistics builder application.

z3r0byt3s · 2020-07-31T07:04:25Z

[RELEASES: 31,721,867]-[FILES: 522,209,821]

Re-generated every 1 hour on nvme disk - took only 10-15 minutes

black-puppydog · 2021-07-20T14:08:53Z

So from what I can see, the query for the statistics (for sqlite3) is built in sqlite3.go:

magnetico/pkg/persistence/sqlite3.go

Lines 447 to 459 in 9c2183b

    
           	// TODO: make it faster! 
        
           	rows, err := db.conn.Query(fmt.Sprintf(` 
        
           			SELECT strftime('%s', discovered_on, 'unixepoch') AS dT 
        
                            , sum(files.size) AS tS 
        
                            , count(DISTINCT torrents.id) AS nD               
        
                            , count(DISTINCT files.id) AS nF 
        
           			FROM torrents, files 
        
            			WHERE     torrents.id = files.torrent_id 
        
                             AND discovered_on >= ? 
        
                             AND discovered_on <= ? 
        
           			GROUP BY dt;`, 
        
           		timef), 
        
           		fromTime.Unix(), toTime.Unix())

Note the TODO item. 😆

I built myself a query as it would be run for today, using one of the arbitrary date formats:

SELECT strftime('%Y-%m-%d', discovered_on, 'unixepoch') AS dT
, sum(files.size) AS tS
, count(DISTINCT torrents.id) AS nD              
, count(DISTINCT files.id) AS nF
FROM torrents, files
WHERE     torrents.id = files.torrent_id
AND discovered_on >= 1626180197 -- this is one week ago
AND discovered_on <= 1626784997 -- this is today
GROUP BY dt;

And indeed it is quite slow on a 34GB database.
So I prefixed the whole thing with EXPLAIN QUERY PLAN and voilà:

QUERY PLAN
|--SCAN TABLE files
|--SEARCH TABLE torrents USING INTEGER PRIMARY KEY (rowid=?)
`--USE TEMP B-TREE FOR GROUP BY

It seems to read the entire files table?! That would certainly explain the slowness.
I tried adding an index to speed up the matching up of torrents.id and files.torrent_id in the WHERE clause:

CREATE INDEX torrent_file_index ON files (torrent_id);

But that didn't change the query plan at all. I then also replaced the WHERE formulation with an INNER JOIN but again got no change. Not sure why though, I don't typically do this kind of DB work.
I'd guess the "best" would be to add a file_count field into the torrents table, here:

magnetico/pkg/persistence/sqlite3.go

Lines 159 to 166 in 9c2183b

    
           	res, err := tx.Exec(` 
        
           		INSERT INTO torrents ( 
        
           			info_hash, 
        
           			name, 
        
           			total_size, 
        
           			discovered_on 
        
           		) VALUES (?, ?, ?, ?); 
        
           	`, infoHash, name, totalSize, time.Now().Unix())

We already have total_size (not sure why the query above uses a sum(files.size) instead) and with file_count we'd be able to completely forego the files table for this query. Of course, that would mean a schema migration... 😐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistics are abysmally slow with large databases #251

Statistics are abysmally slow with large databases #251

kescherCode commented Jul 9, 2020 •

edited

kescherCode commented Jul 9, 2020

z3r0byt3s commented Jul 31, 2020

black-puppydog commented Jul 20, 2021 •

edited

Statistics are abysmally slow with large databases #251

Statistics are abysmally slow with large databases #251

Comments

kescherCode commented Jul 9, 2020 • edited

kescherCode commented Jul 9, 2020

z3r0byt3s commented Jul 31, 2020

black-puppydog commented Jul 20, 2021 • edited

kescherCode commented Jul 9, 2020 •

edited

black-puppydog commented Jul 20, 2021 •

edited