Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the build cache after building every module #3996

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gorbak25
Copy link

@gorbak25 gorbak25 commented Jan 23, 2021

Instead of writing the build cache once a separate thread updates the DB as the compilation results come in.
As a result when interrupting a build and starting it again - the compiler doesn't recompile previously compiled modules.

I've tested the behavior using CTL+C and killall -9 purs - when using CTL+C the PR works as intended, when using killall -9 purs it's possible that the DB might become corrupted due to a partial write but in practice I've rarely triggered that race. Fixing the latter case would require either:

  1. A proper db
  2. Writing to cache-db.json.tmp and then issuing "mv cache-db.json.tmp cache-db.json"

For now I haven't implemented a fix for the possible partial write when receiving SIGKILL as I want to first discuss the direction with maintainers.

Fixes #3995

@rhendric
Copy link
Member

I don't love the idea of rewriting the entire cache file every time a build task completes. It probably doesn't practically matter for projects that aren't very large, but introducing a quadratic complexity paid by everyone to cover a sad-path case doesn't sit right with me.

Without a way to recover from OOMs, though, eager writing of some sort does seem like the only way to achieve incremental resiliency. And that probably means that the cache database would need to be redesigned to something that can be appended to. I don't personally know what else consumes that cache file at the moment—is it depended upon by any external projects, or can we redesign it at will?

@natefaubion
Copy link
Contributor

I don't personally know what else consumes that cache file at the moment—is it depended upon by any external projects, or can we redesign it at will?

The cache db is specifically an internal-only artifact.

@rhendric
Copy link
Member

Oh good. In that case, I would propose refactoring it to be either an append-only stream of JSON objects (with amortized consolidation) or a directory with one file per module. (Either solution would also give a way to recover from partial-write corruption without worrying about temp files.)

@gorbak25
Copy link
Author

gorbak25 commented Jan 23, 2021

Oh good. In that case, I would propose refactoring it to be either an append-only stream of JSON objects (with amortized consolidation) or a directory with one file per module. (Either solution would also give a way to recover from partial-write corruption without worrying about temp files.)

I personally like the idea with an append only json file which gets consolidated at the beginning/end of the compilation - this also means that it would be backwards compatible with existing build caches. @rhendric How about the following backwards-compatible design:

  1. Every json object in the cache is separated by a unique marker(newline/nulbyte or something similar)
  2. Read the cache file and split by the marker - try de-serializing every object and then consolidate the map
  3. As compilation results come in append to the file either {action: "delete", key: "..."} or {action: "update", key: "...", value: "..."}
  4. When the compilation ends we consolidate the map and write the results to cache-db.json.tmp
  5. Issue "mv cache-db.json.tmp cache-db.json" to replace the consolidated file with the old one

Steps 3, 4 ensure that the DB size is bounded. A sudden SIGKILL anywhere between steps 1 - 4 leaves the cache in a recoverable state(In case of SIGKILL we will consolidate the map before starting the compilation).

@f-f
Copy link
Member

f-f commented Jan 23, 2021

Every json object in the cache is separated by a unique marker(newline/nulbyte or something similar)

@gorbak25 there is already a widespread convention for appendable JSON - if we were to follow that we should use newlines rather than null bytes or anything else (something that would also make it nicer to debug manually)

@rhendric
Copy link
Member

@gorbak25 Sounds broadly good!

@kritzcreek
Copy link
Member

I like the atomicity we get from building up a temp file and copying it over the originial cache-db.json.

I really don't want multiple files, as that's a nightmare when you're dealing with parallel processes. Think an ide server as well as a compilation process. In the current model no two processes can open the cache-db file in write mode at the same time, so the filesystem gives us some protection.

We need to keep this part of the codebase as simple as possible, because it's dealing with cache invalidation, concurrency, and file systems at the same time, so we really have no room for accidental complexity,

@gorbak25 Regarding your original problem.

  • If the problem is the compiler compiling too many modules at the same time you can limit the number of threads by passing appropriate RTS arguments to purs like so: purs compile ..... +RTS -N1 -RTS
  • If you can't compile your whole project from scratch, you can pass incrementally larger sets of files to the purs compiler:
~\code\pscid [master ≡]> npx purs compile ".\.spago\prelude\v4.1.1\src\**\*.purs"
Compiling Type.Data.RowList
Compiling Type.Data.Row
Compiling Record.Unsafe
Compiling Data.Symbol
Compiling Data.NaturalTransformation
Compiling Data.Boolean
Compiling Control.Semigroupoid
Compiling Data.Show
Compiling Control.Category
Compiling Data.Void
Compiling Data.Unit
Compiling Data.Semiring
Compiling Data.Semigroup
Compiling Data.Ring
Compiling Data.BooleanAlgebra
Compiling Data.Eq
Compiling Data.CommutativeRing
Compiling Data.Ordering
Compiling Data.EuclideanRing
Compiling Data.Ord.Unsafe
Compiling Data.Ord
Compiling Data.DivisionRing
Compiling Data.Field
Compiling Data.Bounded
Compiling Data.Function
Compiling Data.Monoid
Compiling Data.Functor
Compiling Control.Apply
Compiling Control.Applicative
Compiling Control.Bind
Compiling Control.Monad
Compiling Prelude
Compiling Data.Semigroup.First
Compiling Data.Monoid.Additive
Compiling Data.Monoid.Disj
Compiling Data.Monoid.Conj
Compiling Data.Monoid.Endo
Compiling Data.Monoid.Multiplicative
Compiling Data.Semigroup.Last
Compiling Data.Monoid.Dual
~\code\pscid [master ≡]> npx purs compile ".\.spago\prelude\v4.1.1\src\**\*.purs" ".\.spago\newtype\v3.0.0\src\**\*.purs"
Compiling Data.Newtype

See how the second invocation doesn't recompile the prelude modules?

@rhendric
Copy link
Member

rhendric commented Feb 6, 2021

@gorbak25, just wanted to make sure you saw #4010, if you're still planning on changing the logical cache file format.

@hdgarrood
Copy link
Contributor

Oh oops yes, I was intending to comment here after opening that because this occurred to me while reading this thread, but I must have got distracted. Thanks @rhendric :)

@gorbak25
Copy link
Author

gorbak25 commented Feb 8, 2021

@rhendric Yes I saw it :) Now I'm waiting for some free time to finish this PR :)

@hdgarrood
Copy link
Contributor

I just had another idea. The obvious place to store the information of input timestamps and hashes is in the externs files themselves. We only have a separate cache-db.json file so that we can update the last seen modification times of each of the input source files if need be without having to write to the externs files, because writing to the externs file will automatically trigger rebuilds of all downstream modules.

HOWEVER, we also plan to implement “cutoff” in the not too distant future. Cutoff opportunities occur when a module is rebuilt but its interface is unchanged after rebuilding. In this case, we don’t need to trigger rebuilds of downstream modules if those modules themselves haven’t changed.

The cache-db.json file is only necessary because we don’t have cutoff. If we did, we could safely store the last seen timestamps and hashes of input files in their respective externs file (by not considering this information to be part of the module interface), and not need a separate cache-db.json file at all. Then, the problem this PR aims to fix simply doesn’t arise in the first place.

@rhendric
Copy link
Member

I don't see that causing any immediate practical problems, but I'm a little wary of putting timestamps in extern files. Hashes would be fine, but any time a build output contains a timestamp, I start to worry about whether anyone will want reproducible builds at some point in the future and have to add some janky post-processing to normalize or strip the timestamps out.

@hdgarrood
Copy link
Contributor

Should we consider externs files to be build products from that perspective? I would have expected not, since they don't affect any of the JS file outputs, and you shouldn't really be looking at them if you aren't the compiler.

@rhendric
Copy link
Member

Sure, if the externs files are only ever internal to the compiler, there's no reason to think of them as build products. But if #2477 gets addressed at some point, either the externs files, or something that looks an awful lot like externs files, will be included in the published artifact. We could have two different formats, I suppose, but I don't see a reason to assume we would.

@hdgarrood
Copy link
Contributor

That's a good point. In that case, we might consider dropping the use of source file modified times in recompilation checking entirely; in commercialhaskell/stack#5351, that's what Stack did, and it sounds like for modern hardware the cost of computing the hashes is close to negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When the compiler get's killed by OOM/SIGTERM the cache-db.json file is not updated
6 participants