Update the build cache after building every module #3996

gorbak25 · 2021-01-23T18:57:51Z

Instead of writing the build cache once a separate thread updates the DB as the compilation results come in.
As a result when interrupting a build and starting it again - the compiler doesn't recompile previously compiled modules.

I've tested the behavior using CTL+C and killall -9 purs - when using CTL+C the PR works as intended, when using killall -9 purs it's possible that the DB might become corrupted due to a partial write but in practice I've rarely triggered that race. Fixing the latter case would require either:

A proper db
Writing to cache-db.json.tmp and then issuing "mv cache-db.json.tmp cache-db.json"

For now I haven't implemented a fix for the possible partial write when receiving SIGKILL as I want to first discuss the direction with maintainers.

Fixes #3995

rhendric · 2021-01-23T19:10:08Z

I don't love the idea of rewriting the entire cache file every time a build task completes. It probably doesn't practically matter for projects that aren't very large, but introducing a quadratic complexity paid by everyone to cover a sad-path case doesn't sit right with me.

Without a way to recover from OOMs, though, eager writing of some sort does seem like the only way to achieve incremental resiliency. And that probably means that the cache database would need to be redesigned to something that can be appended to. I don't personally know what else consumes that cache file at the moment—is it depended upon by any external projects, or can we redesign it at will?

natefaubion · 2021-01-23T19:18:30Z

I don't personally know what else consumes that cache file at the moment—is it depended upon by any external projects, or can we redesign it at will?

The cache db is specifically an internal-only artifact.

rhendric · 2021-01-23T19:44:45Z

Oh good. In that case, I would propose refactoring it to be either an append-only stream of JSON objects (with amortized consolidation) or a directory with one file per module. (Either solution would also give a way to recover from partial-write corruption without worrying about temp files.)

gorbak25 · 2021-01-23T20:09:57Z

Oh good. In that case, I would propose refactoring it to be either an append-only stream of JSON objects (with amortized consolidation) or a directory with one file per module. (Either solution would also give a way to recover from partial-write corruption without worrying about temp files.)

I personally like the idea with an append only json file which gets consolidated at the beginning/end of the compilation - this also means that it would be backwards compatible with existing build caches. @rhendric How about the following backwards-compatible design:

Every json object in the cache is separated by a unique marker(newline/nulbyte or something similar)
Read the cache file and split by the marker - try de-serializing every object and then consolidate the map
As compilation results come in append to the file either {action: "delete", key: "..."} or {action: "update", key: "...", value: "..."}
When the compilation ends we consolidate the map and write the results to cache-db.json.tmp
Issue "mv cache-db.json.tmp cache-db.json" to replace the consolidated file with the old one

Steps 3, 4 ensure that the DB size is bounded. A sudden SIGKILL anywhere between steps 1 - 4 leaves the cache in a recoverable state(In case of SIGKILL we will consolidate the map before starting the compilation).

f-f · 2021-01-23T21:32:11Z

Every json object in the cache is separated by a unique marker(newline/nulbyte or something similar)

@gorbak25 there is already a widespread convention for appendable JSON - if we were to follow that we should use newlines rather than null bytes or anything else (something that would also make it nicer to debug manually)

rhendric · 2021-01-23T21:37:34Z

@gorbak25 Sounds broadly good!

kritzcreek · 2021-01-24T12:41:26Z

I like the atomicity we get from building up a temp file and copying it over the originial cache-db.json.

I really don't want multiple files, as that's a nightmare when you're dealing with parallel processes. Think an ide server as well as a compilation process. In the current model no two processes can open the cache-db file in write mode at the same time, so the filesystem gives us some protection.

We need to keep this part of the codebase as simple as possible, because it's dealing with cache invalidation, concurrency, and file systems at the same time, so we really have no room for accidental complexity,

@gorbak25 Regarding your original problem.

If the problem is the compiler compiling too many modules at the same time you can limit the number of threads by passing appropriate RTS arguments to purs like so: purs compile ..... +RTS -N1 -RTS
If you can't compile your whole project from scratch, you can pass incrementally larger sets of files to the purs compiler:

~\code\pscid [master ≡]> npx purs compile ".\.spago\prelude\v4.1.1\src\**\*.purs"
Compiling Type.Data.RowList
Compiling Type.Data.Row
Compiling Record.Unsafe
Compiling Data.Symbol
Compiling Data.NaturalTransformation
Compiling Data.Boolean
Compiling Control.Semigroupoid
Compiling Data.Show
Compiling Control.Category
Compiling Data.Void
Compiling Data.Unit
Compiling Data.Semiring
Compiling Data.Semigroup
Compiling Data.Ring
Compiling Data.BooleanAlgebra
Compiling Data.Eq
Compiling Data.CommutativeRing
Compiling Data.Ordering
Compiling Data.EuclideanRing
Compiling Data.Ord.Unsafe
Compiling Data.Ord
Compiling Data.DivisionRing
Compiling Data.Field
Compiling Data.Bounded
Compiling Data.Function
Compiling Data.Monoid
Compiling Data.Functor
Compiling Control.Apply
Compiling Control.Applicative
Compiling Control.Bind
Compiling Control.Monad
Compiling Prelude
Compiling Data.Semigroup.First
Compiling Data.Monoid.Additive
Compiling Data.Monoid.Disj
Compiling Data.Monoid.Conj
Compiling Data.Monoid.Endo
Compiling Data.Monoid.Multiplicative
Compiling Data.Semigroup.Last
Compiling Data.Monoid.Dual
~\code\pscid [master ≡]> npx purs compile ".\.spago\prelude\v4.1.1\src\**\*.purs" ".\.spago\newtype\v3.0.0\src\**\*.purs"
Compiling Data.Newtype

See how the second invocation doesn't recompile the prelude modules?

rhendric · 2021-02-06T21:49:33Z

@gorbak25, just wanted to make sure you saw #4010, if you're still planning on changing the logical cache file format.

hdgarrood · 2021-02-06T23:39:10Z

Oh oops yes, I was intending to comment here after opening that because this occurred to me while reading this thread, but I must have got distracted. Thanks @rhendric :)

gorbak25 · 2021-02-08T08:44:43Z

@rhendric Yes I saw it :) Now I'm waiting for some free time to finish this PR :)

hdgarrood · 2021-02-25T04:18:20Z

I just had another idea. The obvious place to store the information of input timestamps and hashes is in the externs files themselves. We only have a separate cache-db.json file so that we can update the last seen modification times of each of the input source files if need be without having to write to the externs files, because writing to the externs file will automatically trigger rebuilds of all downstream modules.

HOWEVER, we also plan to implement “cutoff” in the not too distant future. Cutoff opportunities occur when a module is rebuilt but its interface is unchanged after rebuilding. In this case, we don’t need to trigger rebuilds of downstream modules if those modules themselves haven’t changed.

The cache-db.json file is only necessary because we don’t have cutoff. If we did, we could safely store the last seen timestamps and hashes of input files in their respective externs file (by not considering this information to be part of the module interface), and not need a separate cache-db.json file at all. Then, the problem this PR aims to fix simply doesn’t arise in the first place.

rhendric · 2021-02-25T17:08:30Z

I don't see that causing any immediate practical problems, but I'm a little wary of putting timestamps in extern files. Hashes would be fine, but any time a build output contains a timestamp, I start to worry about whether anyone will want reproducible builds at some point in the future and have to add some janky post-processing to normalize or strip the timestamps out.

hdgarrood · 2021-02-26T00:26:03Z

Should we consider externs files to be build products from that perspective? I would have expected not, since they don't affect any of the JS file outputs, and you shouldn't really be looking at them if you aren't the compiler.

rhendric · 2021-02-26T06:32:53Z

Sure, if the externs files are only ever internal to the compiler, there's no reason to think of them as build products. But if #2477 gets addressed at some point, either the externs files, or something that looks an awful lot like externs files, will be included in the published artifact. We could have two different formats, I suppose, but I don't see a reason to assume we would.

hdgarrood · 2021-02-26T17:57:21Z

That's a good point. In that case, we might consider dropping the use of source file modified times in recompilation checking entirely; in commercialhaskell/stack#5351, that's what Stack did, and it sounds like for modern hardware the cost of computing the hashes is close to negligible.

[WIP] Update the build cache after building every module

f48dd3b

gorbak25 force-pushed the fix_partial_build branch from f366c33 to f48dd3b Compare February 23, 2021 18:23

hdgarrood mentioned this pull request Feb 25, 2021

Make cache-db.json into a binary file #4010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the build cache after building every module #3996

Update the build cache after building every module #3996

gorbak25 commented Jan 23, 2021 •

edited

rhendric commented Jan 23, 2021

natefaubion commented Jan 23, 2021

rhendric commented Jan 23, 2021

gorbak25 commented Jan 23, 2021 •

edited

f-f commented Jan 23, 2021

rhendric commented Jan 23, 2021

kritzcreek commented Jan 24, 2021

rhendric commented Feb 6, 2021

hdgarrood commented Feb 6, 2021

gorbak25 commented Feb 8, 2021

hdgarrood commented Feb 25, 2021

rhendric commented Feb 25, 2021

hdgarrood commented Feb 26, 2021

rhendric commented Feb 26, 2021

hdgarrood commented Feb 26, 2021

Update the build cache after building every module #3996

Are you sure you want to change the base?

Update the build cache after building every module #3996

Conversation

gorbak25 commented Jan 23, 2021 • edited

rhendric commented Jan 23, 2021

natefaubion commented Jan 23, 2021

rhendric commented Jan 23, 2021

gorbak25 commented Jan 23, 2021 • edited

f-f commented Jan 23, 2021

rhendric commented Jan 23, 2021

kritzcreek commented Jan 24, 2021

rhendric commented Feb 6, 2021

hdgarrood commented Feb 6, 2021

gorbak25 commented Feb 8, 2021

hdgarrood commented Feb 25, 2021

rhendric commented Feb 25, 2021

hdgarrood commented Feb 26, 2021

rhendric commented Feb 26, 2021

hdgarrood commented Feb 26, 2021

gorbak25 commented Jan 23, 2021 •

edited

gorbak25 commented Jan 23, 2021 •

edited