Cache table and schema indexes on schema address #7859

max-hoffman · 2024-05-14T16:44:14Z

The bulk of ~1ms read and write TPC-C queries benefit from caching table and index schemas, which have a lifecycle between schema migrations/alter statements/new table additions. This is in contrast to how we've typically cached objects using the root value hash, which is great for read-only workflows, but has a much shorter half-life.

max-hoffman · 2024-05-14T16:44:20Z

#benchmark

github-actions · 2024-05-14T16:44:47Z

@max-hoffman workflow run: https://github.com/dolthub/dolt/actions/runs/9083070963

coffeegoddd · 2024-05-14T17:17:26Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`51a2cae`	ok	5937457

version	total_tests
`51a2cae`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-14T17:20:00Z

@max-hoffman DOLT

test_name	from_latency_median	to_latency_median	is_faster
tpcc-scale-factor-1	97.55	90.78	0

test_name	server_name	server_version	tps	test_name	server_name	server_version	tps	is_faster
tpcc-scale-factor-1	dolt	`eb86db5`	22.82	tpcc-scale-factor-1	dolt	`51a2cae`	23.82	0

coffeegoddd · 2024-05-14T18:07:07Z

@max-hoffman DOLT

read_tests	from_latency_median	to_latency_median
covering_index_scan	3.02	3.02
groupby_scan	17.63	17.63
index_join	5.18	5.28
index_join_scan	2.22	2.22
index_scan	53.85	53.85
oltp_point_select	0.52	0.51
oltp_read_only	8.43	8.43
select_random_points	0.81	0.81
select_random_ranges	0.97	0.97
table_scan	54.83	55.82
types_table_scan	134.9	134.9

write_tests	from_latency_median	to_latency_median
oltp_delete_insert	6.67	6.67
oltp_insert	3.25	3.19
oltp_read_write	16.12	16.12
oltp_update_index	3.49	3.49
oltp_update_non_index	3.43	3.36
oltp_write_only	7.56	7.56
types_delete_insert	7.56	7.43

…te.sh

coffeegoddd · 2024-05-14T20:25:38Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`4f42142`	ok	5937457

version	total_tests
`4f42142`	5937457

correctness_percentage
100.0

…to max/schema-based-caching

coffeegoddd · 2024-05-14T23:24:37Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`f9bfd9e`	ok	5937457

version	total_tests
`f9bfd9e`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-15T17:37:45Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`ffe7e0a`	ok	5937457

version	total_tests
`ffe7e0a`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-15T22:11:10Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`51caf2a`	ok	5937457

version	total_tests
`51caf2a`	5937457

correctness_percentage
100.0

zachmu

General idea is sound, but I have some strong objections about coupling construction of a writer object to session cache management in this way. See comments.

Also it looks like benchmark results are kind of mixed? Do they need to be re-run?

go/libraries/doltcore/sqle/database.go

zachmu · 2024-05-17T20:22:23Z

go/libraries/doltcore/sqle/dsess/session.go

 	}
 }

+type writeSessFunc func(nbf *types.NomsBinFormat, ws *doltdb.WorkingSet, aiTracker globalstate.AutoIncrementTracker, opts editor.Options) WriteSession


Should be an exported type since it's in constructors

Also needs comment

zachmu · 2024-05-17T20:24:08Z

go/libraries/doltcore/sqle/dsess/session_cache.go

 )

 // SessionCache caches various pieces of expensive to compute information to speed up future lookups in the session.
 type SessionCache struct {
-	indexes  map[doltdb.DataCacheKey]map[string][]sql.Index
-	tables   map[doltdb.DataCacheKey]map[TableCacheKey]sql.Table
+	// indexes is keyed by table schema


Table schema's hash?

zachmu · 2024-05-17T20:31:29Z

go/libraries/doltcore/sqle/dsess/session_cache.go

+func (c *SessionCache) GetCachedWriterState(key doltdb.DataCacheKey) (*WriterState, bool) {
+	c.mu.RLock()
+	defer c.mu.RUnlock()
+	if c.writers == nil {


Think you can safely omit this, reads from a nil map work fine

zachmu · 2024-05-17T20:34:20Z

go/libraries/doltcore/sqle/testutil.go

@@ -33,6 +33,7 @@ import (
 	"github.com/dolthub/dolt/go/libraries/doltcore/schema"
 	"github.com/dolthub/dolt/go/libraries/doltcore/sqle/dsess"
 	"github.com/dolthub/dolt/go/libraries/doltcore/sqle/sqlutil"
+	"github.com/dolthub/dolt/go/libraries/doltcore/sqle/writer"
 	"github.com/dolthub/dolt/go/libraries/doltcore/table/editor"
 	config2 "github.com/dolthub/dolt/go/libraries/utils/config"


Change this while you're at it

zachmu · 2024-05-17T20:39:34Z

go/libraries/doltcore/sqle/writer/prolly_table_writer.go

-	sqlSch, err := sqlutil.FromDoltSchema("", w.tableName.Name, sch)
-	if err != nil {
-		return err
+func (w *prollyTableWriter) Reset(ctx *sql.Context, sess *prollyWriteSession, tbl *doltdb.Table, sch schema.Schema) error {


This is really bad, we really don't want to couple session state management to correct writer behavior like this. If this needs to happen for correctness, then the control of information flow needs to be one way only. Either the session owns this information and the caches of it and is responsible for clearing them, or this writer only manages its own internal state on reset.

I'm not sure I follow this. The writer schema state needs to live in the session to be cached between transactions. It's only used for writing, so it's going to be coupled to writers. It's exclusively a property of the table, so anywhere we try to write the table seems valid to check for the cached write state as long as the table is well-formed (its schema is consistent with its data). If we passed the latest table to GetTableWriter would that alleviate your concerns? Then there's no ambiguity maybe, we're disregarding whatever is in the existing write session, replacing it with the schema values for the given table.

So follow up on this, the preexisting coupling is probably worse than we originally thought. I can't add table parameter to these methods without breaking correctness. Certain writers rely on using the tables in prollyTableWriter.workingSet rather than whatever is available from the calling scope. I don't really like the indirection, but I would at least need to rewrite all of the fulltext logic to make this not the case. It's possible it extends to other areas. All of the action happens in GetTableWriter/Reset and package cycles prevent me from moving the logic out of dsess. The places where we have the context to create writers is the only place to manage the caching lifecycle.

Punt I guess. This is at least a bit more self-contained now.

zachmu · 2024-05-17T20:47:54Z

go/libraries/doltcore/sqle/writer/prolly_write_session.go


 // GetTableWriter implemented WriteSession.
-func (s *prollyWriteSession) GetTableWriter(ctx *sql.Context, table doltdb.TableName, db string, setter SessionRootSetter) (TableWriter, error) {
+func (s *prollyWriteSession) GetTableWriter(ctx *sql.Context, table doltdb.TableName, db string, setter dsess.SessionRootSetter) (dsess.TableWriter, error) {


Same comment as above. This is really bad separation of concerns.

Looking around, it appears that this method is called in exactly one place where caching is critical: (t *WritableDoltTable) getTableEditor. That's where this caching logic should be applied.

zachmu · 2024-05-17T20:48:44Z

go/libraries/doltcore/sqle/writer/prolly_write_session.go

@@ -106,6 +123,44 @@ func (s *prollyWriteSession) GetTableWriter(ctx *sql.Context, table doltdb.Table
 	return twr, nil
 }

+func getWriterSchemas(ctx *sql.Context, table *doltdb.Table, tableName string) (*dsess.WriterState, error) {


This logic belongs in a layer above as well (either tables.go, or in the SessionCache, either would be fine). Then pass this object into GetTableWriter

max-hoffman · 2024-05-20T18:28:45Z

#benchmark

github-actions · 2024-05-20T18:29:09Z

@max-hoffman workflow run: https://github.com/dolthub/dolt/actions/runs/9163292574

…te.sh

coffeegoddd · 2024-05-20T19:04:53Z

@max-hoffman DOLT

test_name	from_latency_median	to_latency_median	is_faster
tpcc-scale-factor-1	97.55	90.78	0

test_name	server_name	server_version	tps	test_name	server_name	server_version	tps	is_faster
tpcc-scale-factor-1	dolt	`68db779`	22.59	tpcc-scale-factor-1	dolt	`1bde799`	23.74	0

coffeegoddd · 2024-05-20T19:18:54Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`c1f36dd`	ok	5937457

version	total_tests
`c1f36dd`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-20T19:28:01Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`bd68cfd`	ok	5937457

version	total_tests
`bd68cfd`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-20T19:51:56Z

@max-hoffman DOLT

read_tests	from_latency_median	to_latency_median
covering_index_scan	3.13	3.02
groupby_scan	17.63	17.32
index_join	5.28	5.18
index_join_scan	2.22	2.22
index_scan	53.85	52.89
oltp_point_select	0.51	0.52
oltp_read_only	8.43	8.43
select_random_points	0.8	0.81
select_random_ranges	0.97	0.99
table_scan	54.83	54.83
types_table_scan	137.35	134.9

write_tests	from_latency_median	to_latency_median
oltp_delete_insert	6.67	6.55
oltp_insert	3.25	3.19
oltp_read_write	16.12	15.83
oltp_update_index	3.49	3.36
oltp_update_non_index	3.43	3.3
oltp_write_only	7.56	7.43
types_delete_insert	7.43	7.3

coffeegoddd · 2024-05-20T19:58:51Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`738257a`	ok	5937457

version	total_tests
`738257a`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-21T19:10:56Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`5bf8b94`	ok	5937457

version	total_tests
`5bf8b94`	5937457

correctness_percentage
100.0

…te.sh

coffeegoddd · 2024-05-21T19:52:15Z

@max-hoffman DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`6edea18`	ok	5937457

version	total_tests
`6edea18`	5937457

correctness_percentage
100.0

coffeegoddd · 2024-05-21T20:00:03Z

@coffeegoddd DOLT

comparing_percentages
100.000000 to 100.000000

version	result	total
`9bf3ce2`	ok	5937457

version	total_tests
`9bf3ce2`	5937457

correctness_percentage
100.0

zachmu · 2024-05-21T20:15:01Z

go/libraries/doltcore/sqle/writer/prolly_table_writer.go

-	sqlSch, err := sqlutil.FromDoltSchema("", w.tableName.Name, sch)
-	if err != nil {
-		return err
+func (w *prollyTableWriter) Reset(ctx *sql.Context, sess *prollyWriteSession, tbl *doltdb.Table, sch schema.Schema) error {


Punt I guess. This is at least a bit more self-contained now.

github-actions · 2024-05-22T02:19:37Z

@coffeegoddd DOLT

test_name	detail	row_cnt	sorted	mysql_time	sql_mult	cli_mult
batching	LOAD DATA	10000	1	0.05	1.4
batching	batch sql	10000	1	0.07	1.71
batching	by line sql	10000	1	0.07	1.86
blob	1 blob	200000	1	0.91	3.84	4.38
blob	2 blobs	200000	1	0.87	4.9	5.23
blob	no blob	200000	1	0.89	2.07	2.13
col type	datetime	200000	1	0.82	2.6	2.79
col type	varchar	200000	1	0.68	3.01	2.87
config width	2 cols	200000	1	0.79	2.13	2.13
config width	32 cols	200000	1	1.89	1.74	2.44
config width	8 cols	200000	1	1.01	1.98	2.14
pk type	float	200000	1	0.86	1.94	2
pk type	int	200000	1	0.91	2.16	1.86
pk type	varchar	200000	1	1.54	1.46	1.56
row count	1.6mm	1600000	1	5.54	2.49	2.55
row count	400k	400000	1	1.41	2.42	2.44
row count	800k	800000	1	2.8	2.46	2.5
secondary index	four index	200000	1	3.57	1.28	1.07
secondary index	no secondary	200000	1	0.88	2.11	2.1
secondary index	one index	200000	1	1.11	2.16	2.13
secondary index	two index	200000	1	1.98	1.59	1.44
sorting	shuffled 1mm	1000000	0	5.08	2.46	2.49
sorting	sorted 1mm	1000000	1	5.05	2.47	2.49

github-actions · 2024-05-22T02:27:29Z

@coffeegoddd DOLT

name	detail	mean_mult
dolt_blame_basic	system table	1.29
dolt_blame_commit_filter	system table	3.4
dolt_commit_ancestors_commit_filter	system table	0.87
dolt_commits_commit_filter	system table	0.94
dolt_diff_log_join_from_commit	system table	2.15
dolt_diff_log_join_to_commit	system table	2.17
dolt_diff_table_from_commit_filter	system table	1.07
dolt_diff_table_to_commit_filter	system table	1.12
dolt_diffs_commit_filter	system table	0.97
dolt_history_commit_filter	system table	1.21
dolt_log_commit_filter	system table	0.94

github-actions · 2024-05-22T02:46:49Z

@coffeegoddd DOLT

name	add_cnt	delete_cnt	update_cnt	latency
adds_only	60000	0	0	0.73
adds_updates_deletes	60000	60000	60000	3.87
deletes_only	0	60000	0	1.89
updates_only	0	0	60000	2.45

Cache table and schema indexes on schema address

51a2cae

coffeegoddd added the correctness_approved label May 14, 2024

max-hoffman and others added 5 commits May 14, 2024 12:23

skip versioning tests, they cache a bad session writer

43d47f7

more tests

4f42142

views->fragment hash

7349dd9

merge main

1932f02

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

ecf8e62

…te.sh

max-hoffman added 4 commits May 14, 2024 15:31

trigger/view caches back to rootval

f275965

comments

dee6487

Merge branch 'max/schema-based-caching' of github.com:dolthub/dolt in…

c8c9c8b

…to max/schema-based-caching

build issue

f9bfd9e

max-hoffman added 2 commits May 15, 2024 09:43

rewrite versioned test setup

67138e1

copyright

ffe7e0a

bad refactor

51caf2a

zachmu reviewed May 17, 2024

View reviewed changes

refactors

1bde799

max-hoffman and others added 2 commits May 20, 2024 11:47

fix ignore/docs tables

c1f36dd

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

bd68cfd

…te.sh

max-hoffman added 2 commits May 20, 2024 12:25

docs table fix

7591c74

fmt

738257a

revert much of the interface refactors

5bf8b94

max-hoffman and others added 2 commits May 21, 2024 12:19

Merge branch 'main' into max/schema-based-caching

6edea18

[ga-format-pr] Run go/utils/repofmt/format_repo.sh and go/Godeps/upda…

9bf3ce2

…te.sh

zachmu approved these changes May 21, 2024

View reviewed changes

max-hoffman merged commit e958f36 into main May 21, 2024
18 of 20 checks passed

max-hoffman deleted the max/schema-based-caching branch May 21, 2024 21:25

BrewTestBot mentioned this pull request May 24, 2024

dolt 1.38.3 Homebrew/homebrew-core#172668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache table and schema indexes on schema address #7859

Cache table and schema indexes on schema address #7859

max-hoffman commented May 14, 2024 •

edited

max-hoffman commented May 14, 2024

github-actions bot commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 15, 2024

coffeegoddd commented May 15, 2024

zachmu left a comment

zachmu May 17, 2024

zachmu May 17, 2024

zachmu May 17, 2024

zachmu May 17, 2024

zachmu May 17, 2024

max-hoffman May 18, 2024

max-hoffman May 21, 2024 •

edited

zachmu May 21, 2024

zachmu May 17, 2024

zachmu May 17, 2024

max-hoffman commented May 20, 2024

github-actions bot commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 21, 2024

coffeegoddd commented May 21, 2024

coffeegoddd commented May 21, 2024

zachmu May 21, 2024

github-actions bot commented May 22, 2024

github-actions bot commented May 22, 2024

github-actions bot commented May 22, 2024

Cache table and schema indexes on schema address #7859

Cache table and schema indexes on schema address #7859

Conversation

max-hoffman commented May 14, 2024 • edited

max-hoffman commented May 14, 2024

github-actions bot commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 14, 2024

coffeegoddd commented May 15, 2024

coffeegoddd commented May 15, 2024

zachmu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman commented May 20, 2024

github-actions bot commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 20, 2024

coffeegoddd commented May 21, 2024

coffeegoddd commented May 21, 2024

coffeegoddd commented May 21, 2024

Choose a reason for hiding this comment

github-actions bot commented May 22, 2024

github-actions bot commented May 22, 2024

github-actions bot commented May 22, 2024

max-hoffman commented May 14, 2024 •

edited

max-hoffman May 21, 2024 •

edited