Fix init soundness #13472

liufengyun · 2021-09-04T06:59:52Z

Fix init soundness and align with the theory
Code refactoring

olhotak · 2021-09-23T09:09:03Z

I have some high-level comments on the design that we should discuss in our next meeting.

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

olhotak · 2021-09-28T19:19:10Z

The heap is still an obstacle to reasoning about the correctness (soundness) of the analysis.

The heap appears to conflate multiple different things. Some of those things are essential to the theory of the analysis and other things are just caches. It is difficult to see that those that are just caches really are used only as caches.

Specifically, as far as I could tell, the heap contains, for each object, three things:

For each field, a bit indicating whether it is already initialized or not. This is essential.
For each field, a cache of the computation of its rhs. This is hopefully used only as a cache.
For each outer, a cache of the eval of that outer. This is hopefully used only as a cache.
I would like to separate the above three things so that it becomes clear how each is being used, and whether they are really being used only in the ways we think they are.

For number 2., do we even need to cache it in the heap? We should already be caching it in the normal cache when we eval that rhs, so also caching it in the heap might be redundant.

There seems to also be some significant difference between heap objects for ThisRef (Hot) and for Warm refs. If there is, we should consider separating those as well.

A separate thing that I don't understand, and specifically I don't understand how it affects soundness, is the isPopulatingParams flag. It seems that the heap is also used for parameters in addition to fields, and that the two behave differently. Perhaps we should also separate parameters from non-parameter fields to clarify how this works.

liufengyun · 2021-09-29T18:37:51Z

Based on an idea from the online meeting, it seems there is a simple way to reason about the correctness of the heap.

Suppose that in each iteration, we always use an empty heap. This always works (might be a little slower for some tests). The heap does not play a role, as it's not an input to the iteration function. It's easy to justify that this approach is correct as discussed.

Now, when we reach a fixed point after an iteration, the values in the heap must all be fixed points. Then why throw them away, instead of treating them as stable cache? In case if the iteration does not reach a fixed point, we simply use the previous fixed-point heap for the next iteration (reversion). The implementation performs exactly this optimization.

olhotak · 2021-09-30T07:55:15Z

Based on an idea from the online meeting, it seems there is a simple way to reason about the correctness of the heap.

Suppose that in each iteration, we always use an empty heap. This always works (might be a little slower for some tests). The heap does not play a role, as it's not an input to the iteration function. It's easy to justify that this approach is correct as discussed.

Now, when we reach a fixed point after an iteration, the values in the heap must all be fixed points. Then why throw them away, instead of treating them as stable cache? In case if the iteration does not reach a fixed point, we simply use the previous fixed-point heap for the next iteration (reversion). The implementation performs exactly this optimization.

The problem is that the computation of what we place in the heap (presumably) depends on the current state of the Cache. In different iterations (and when analyzing the constructors of different classes), the Cache has changed. How do we know that the values that we placed in the heap in previous iterations are the same as we would place in the heap in the current iteration if we were to start with an empty heap?

olhotak · 2021-09-30T08:02:49Z

Perhaps it's worthwhile also to think separately about the two outputs of eval: the result Value and the list of errors. Currently, eval is a function from some list of inputs -- too many inputs -- to both a result value and a list of errors. If we could think about it as two functions, inputs1 => value and inputs2 => errors, we might be able to see that inputs1 contains fewer things than inputs2, for example maybe it doesn't contains parts of the current heap, and this would help show the correctness of finding a fixed point of the inputs1 => value function.

Of course in the implementation we probably need to compute both functions together, but maybe even there there is some way to refactor parts of the code to make it clearer that the value doesn't depend on as many things. But before even looking at changing the code, even just identifying and documenting which things the inputs1 => value function depends on and doesn't depend on (i.e. which things are in inputs1) would already be a good first step.

liufengyun · 2021-09-30T08:20:38Z

Perhaps it's worthwhile also to think separately about the two outputs of eval

If I remember correctly from the meeing,eval does not matter, because it's not the iteration function --- the iteration function is doWork.

The problem is that the computation of what we place in the heap (presumably) depends on the current state of the Cache.

I don't see why it matters if the things it depends on are fixed points already and are then committed to global cache. Otherwise, global cache is problematic as well.

olhotak · 2021-10-04T12:35:19Z

I've tried several attempts to simplify this so we can better reason about its correctness, but I keep getting stuck.

One thing that would help is if it were possible to separate the "populating parameters" part of checking an object from checking the class body. I see in some cases we already populate parameters without checking the class body. I also see that both rely on determining the linearization order. But (correct me if I'm wrong) populating parameters doesn't depend on reading non-parameter fields of the same object, so cannot read an uninitialized field. If that's the case, then it's not strictly necessary to interleave populating parameters and checking the class body of superclasses of a given class; we could populate parameters, and then, separately, check the class body.

Separating them would require splitting out the code that determines the linearization order because that's needed for both.

If we could do this, then we could have a "populate parameters" function whose result would be part of the abstract domain just like the tree->value map currently is. That is, the abstract domain would be two maps, tree->value and ref->parameters/outers (currently Objekt). Then it would be easier to show correctness since we would find a fixed point for this pair of maps. This is important because populating parameters depends on eval and eval depends on populating parameters.

liufengyun · 2021-10-04T22:28:23Z

But (correct me if I'm wrong) populating parameters doesn't depend on reading non-parameter fields of the same object, so cannot read an uninitialized field. If that's the case, then it's not strictly necessary to interleave populating parameters and checking the class body of superclasses of a given class; we could populate parameters, and then, separately, check the class body.

Populating the outers and class parameters may involve evaluating any code, as constructor arguments can be method calls, which in turn may contain any code.

Then it would be easier to show correctness since we would find a fixed point for this pair of maps. This is important because populating parameters depends on eval and eval depends on populating parameters.

Here it seems there is an important different between our analysis and the Abstract Definitional Interpreters paper: there, due to the nature of the whole program analysis, it seems that eval is the iterative function (Strictly speaking, it's not in that setting either).

In our case, local reasoning enables modular checking. The iterative function of the modular check is doWork --- whose implementation uses eval, init and populateParams. The input to doWork for a given checking class is Cache.in, output is Cache.out (suppose we make the input heap empty, Omega is always empty).

Now, if doWork reaches a fixed piont, then all the values in heap are fixed points too (all values in the heap are in the output cache) --- then why throw the heap away, instead of using it as global cache? If an iteration does not reach fixed point, we need to revert the heap to its previous stable state.

olhotak · 2021-10-07T08:19:55Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

+      if cache.containsObject(this) then this
+      else this.ensureFresh().populateParams()
+
+    def ensureObjectFreshAndPopulated(): Contextual[this.type] =


unused, consider removing

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

olhotak · 2021-10-07T08:31:04Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

+     *
+     *  Invariant: fields are immutable and only set once
+     */
+    def updateField(field: Symbol, value: Value): Contextual[Unit] = log("set field " + field + " of " + ref + " to " + value) {


Suggested change

def updateField(field: Symbol, value: Value): Contextual[Unit] = log("set field " + field + " of " + ref + " to " + value) {

def updateField(field: Symbol, value: Value): Contextual[Unit] = log("set field " + field + " of " + ref + " to " + value) {

assert(!cache.stableHeapContains(ref))

This helps ensure that the stable heap really remains stable. During an iteration, we should not update fields of objects that are already in the stable heap and therefore supposed to be stable.

I've checked that the tests pass with this.

Note that this requires stableHeapContains method to be added to Cache because heapStable is private.

It's a very good invariant to check. It seems your suggested change above already adds the same assertion in Cache.updateObject, maybe we can skip this duplicate assertion here.

It's not intended to be a duplicate. stableHeap is a two-level map: from heap with Ref to Objekt, then from Objekt with field (Symbol) to Value. The two assertions are for the two levels of the map, to ensure that both levels stay stable.

The objects are immutable data structures, this method will call cache.updateObject(ref, obj2), thus reach the invariant there.

OK. I missed that Objekt was immutable. I mistakenly thought fields and outers were mutable maps in an Objekt.

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

olhotak · 2021-10-07T08:37:54Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

-  case class Warm(klass: ClassSymbol, outer: Value, ctor: Symbol, args: List[Value]) extends Addr
+  case class Warm(klass: ClassSymbol, outer: Value, ctor: Symbol, args: List[Value]) extends Ref {
+
+    /** If a warm value is in the process of populating parameters, class bodies are not executed. */


Consider renaming populatingParams to something like dontCheckBodies.

It's not clear what "populating parameters" refers to since we always populate parameters of every class we abstractly interpret. As far as I can tell this is used only when we widen an existing ref to one with Cold outer and need to recompute parameters but don't want to check the bodies because we've already checked them. A comment would help here to explain why/when we do this (or something different if my understanding here is incorrect).

Having a flag like this seems fragile compared to passing a parameter to init, callConstructor, etc., or having one in the implicit context, but both of those are messy. I guess the flag is OK. Is there an issue with recursion? What if during computing a parameter of a Warm we instantiate the same Warm, is it possible that the flag will be incorrect for that other Warm?

Consider having two flags or parameters or implicit parameters: populateParams and checkBodies. In init, we go through constructors and do rather separate things in each one we encounter. Having them separated under flags would make it clearer what each part of the init method is doing: some of it is common plumbing/iteration to identify superclasses in the right order, then other parts are specific to one or the other task.

Having two flags could help us avoid wasted work in recomputing parameters in the case explained in your comment in updateField: in that case, we could run init with checkBodies true but populateParams false (since we've already run populateParams before but without checking bodies, and have now discovered that we need to check bodies as well.)

Also consider splitting these two tasks (populate params and check bodies) in init to methods that could be called from init. At the moment, init is large.

Consider having two flags or parameters or implicit parameters: populateParams and checkBodies.

As you commented, that will become messy because they have to thread through every place. Also, if we have multiple warm values in a context, it'll be even more confusing and will be a source of errors.

Is there an issue with recursion? What if during computing a parameter of a Warm we instantiate the same Warm, is it possible that the flag will be incorrect for that other Warm?

It could be --- in those cases, the outers and parameters might be set twice, that's one reason we need to weaken the invariant that the fields and outers are set once.

It's not clear what "populating parameters" refers to since we always populate parameters of every class we abstractly interpret. As far as I can tell this is used only when we widen an existing ref to one with Cold outer and need to recompute parameters but don't want to check the bodies because we've already checked them.

There are other cases: we first use a Warm value from previous cache, which requires populating the parameters, and later we create an instance of the warm value, which requires initialization checking, thus resetting outers and parameters. The two might be interleaved, as mentioned in the Q/A above.

Also consider splitting these two tasks (populate params and check bodies) in init to methods that could be called from init.

init follows the semantics of initialization, I don't see an easy way to make it simpler.

olhotak · 2021-10-07T08:52:26Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

+       *  3. Revert heap if instable.
+       *
+       */
+      def prepareForNextIteration(isStable: Boolean)(using State, Context) =


I think this would be clearer with two methods rather than one method with a flag. That is, have one method commitToStable and a second method prepareForNextIteration, and then either call both of them in sequence or only prepareForNextIteration.

olhotak · 2021-10-07T09:01:47Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

+    final def work()(using State, Context): Unit =
+      pendingTasks match
+      case task :: rest =>
+        checkedTasks = checkedTasks + task


This is the key iteration loop. It would make the overall algorithm more understandable if the code showed more explicitly what an iteration does. The structure I have in mind is something like:

// code to prepare cache and heap for analyzing a new class def iterate = { // code to prepare cache and heap for next iteration doTask(task) if(hasChanged) iterate else // code to commit cache/heap to stable } iterate pendingTasks = rest

We need to clarify whether a "task" means doing all the iterations for a class or doing just one. Currently it's just one iteration but that doesn't fit with the idea of a worklist since it wouldn't be correct to interleave individual iterations for different classes. So I think a "task" should be all of the analysis of a single class.

We don't really need the worklist anymore but I think we should keep it in case we change the design to need it in the future.

Yes, we may need it in the future. Currently, we can't remove the worklist, because we only want to traverse the trees once.

olhotak · 2021-10-07T09:10:44Z

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala


    var fieldsChanged = true

    // class body
-    tpl.body.foreach {
+    if !thisV.isWarm || !thisV.asInstanceOf[Warm].isPopulatingParams then tpl.body.foreach {


Consider adding isThisRef method to avoid negated !isWarm. See also below on line 1407.

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala

This makes it closer to the Scala Symposium 21 paper.

The problem is that all warm objects are populated with outers and class parameters in Heap.prepare(). If we call `init` on those objects, the only-set-once invariant will be violated and crash.

Two motivations: First, it's close to the essence of its functionality. Second, it's easier to update the cache state in fixed-pointed computation.

The following test explodes: tests/init/pos/local-warm5.scala. Ideally we should use an actual set in RefSet, which we don't do for both performance and determinism. The performance concern might be a premature optimization.

Use correct klass for evaluating rhs of local variables.

Co-authored-by: Ondřej Lhoták <olhotak@uwaterloo.ca> Co-authored-by: Natsu Kagami <natsukagami@gmail.com>

liufengyun · 2021-10-12T16:22:54Z

This performance is checked in #13715 and can be seen here:

https://dotty-bench.epfl.ch/13715/

The last two points correspond to the performance of -Ysafe-init for master and this PR respectively.

liufengyun mentioned this pull request Sep 4, 2021

WIP - Fix init soundness #13471

Closed

liufengyun force-pushed the fix-init-soundness branch from 99fed5a to ccde63f Compare September 7, 2021 20:49

liufengyun self-assigned this Sep 15, 2021

liufengyun changed the title ~~WIP - Fix init soundness~~ Fix init soundness Sep 20, 2021

liufengyun force-pushed the fix-init-soundness branch from 72e50ee to efe7a9b Compare September 21, 2021 20:04

liufengyun marked this pull request as ready for review September 22, 2021 21:20

liufengyun requested a review from olhotak September 22, 2021 21:21

liufengyun assigned olhotak and unassigned liufengyun Sep 22, 2021

natsukagami reviewed Sep 28, 2021

View reviewed changes

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala Show resolved Hide resolved

olhotak requested changes Oct 7, 2021

View reviewed changes

olhotak reviewed Oct 7, 2021

View reviewed changes

compiler/src/dotty/tools/dotc/transform/init/Semantic.scala Show resolved Hide resolved

EnzeXing and others added 10 commits October 7, 2021 22:56

Update tests

857cc74

Add worklist and refactor definition for ThisRef

ef43eb5

This makes it closer to the Scala Symposium 21 paper.

Rename Addr to Ref

5d1fe86

Refactor: make state explicit

7279f73

Add constructor to Warm

20ff7ca

Revert ThisRef to previous version

f5651b5

Refactor constructor call

6763285

Make cache part of state

ab09fd2

Handle cache in work list

98f9f7e

Don't iterate if errors found

fb0c347

liufengyun and others added 23 commits October 7, 2021 22:56

Make sure the object of a reference assumption value exists

5b00b1f

Add back arguments to ThisRef

04aa72b

Populate parameters of Warm objects by reusing init

ad503fc

Make sure the class parameters and outers of warm objects are populated

fb3b6ca

Use correct trace

bc98ce8

Populate class parameters for warm objects in a heap prepare step

b27d58b

Try to check warm values immediately

99df102

Fix crash: make sure object fresh before calling init

4163fd3

The problem is that all warm objects are populated with outers and class parameters in Heap.prepare(). If we call `init` on those objects, the only-set-once invariant will be violated and crash.

Be more sensitive for values Warm(outer = ThisRef)

683d537

Fix non-determinism in test

ccbb355

Make isPopulatingParams a flag in Warm

fc76549

Make heap as part of cache

191877a

Two motivations: First, it's close to the essence of its functionality. Second, it's easier to update the cache state in fixed-pointed computation.

Update test

f5540f5

Add more debugging information

a15a06c

Lazily populate warm objects

46793f1

Add comment

8bbe384

Fix non-termination

dab8181

The following test explodes: tests/init/pos/local-warm5.scala. Ideally we should use an actual set in RefSet, which we don't do for both performance and determinism. The performance concern might be a premature optimization.

Fix pickling test

1ab5316

Remove unused parameters of ThisRef

1e802c9

Fix compilation of scodec

8088217

Use correct klass for evaluating rhs of local variables.

Apply suggestions from code review

6566756

Co-authored-by: Ondřej Lhoták <olhotak@uwaterloo.ca> Co-authored-by: Natsu Kagami <natsukagami@gmail.com>

Address review comments

15a51d7

Address review: Add comment to populateParams

7ee58a3

liufengyun force-pushed the fix-init-soundness branch from a266132 to 7ee58a3 Compare October 7, 2021 21:26

Fix compilation

e905cd5

olhotak approved these changes Oct 13, 2021

View reviewed changes

olhotak merged commit af9594d into scala:master Oct 14, 2021

liufengyun deleted the fix-init-soundness branch February 8, 2022 08:36

Kordyjan added this to the 3.1.1 milestone Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix init soundness #13472

Fix init soundness #13472

liufengyun commented Sep 4, 2021

olhotak commented Sep 23, 2021

olhotak commented Sep 28, 2021

liufengyun commented Sep 29, 2021

olhotak commented Sep 30, 2021

olhotak commented Sep 30, 2021 •

edited

liufengyun commented Sep 30, 2021

olhotak commented Oct 4, 2021

liufengyun commented Oct 4, 2021

olhotak Oct 7, 2021

olhotak Oct 7, 2021

liufengyun Oct 7, 2021

olhotak Oct 8, 2021

liufengyun Oct 8, 2021

olhotak Oct 8, 2021

olhotak Oct 7, 2021

olhotak Oct 7, 2021

liufengyun Oct 7, 2021

olhotak Oct 7, 2021

olhotak Oct 7, 2021

olhotak Oct 7, 2021

liufengyun Oct 7, 2021

olhotak Oct 7, 2021

liufengyun commented Oct 12, 2021

	def updateField(field: Symbol, value: Value): Contextual[Unit] = log("set field " + field + " of " + ref + " to " + value) {
	def updateField(field: Symbol, value: Value): Contextual[Unit] = log("set field " + field + " of " + ref + " to " + value) {
	assert(!cache.stableHeapContains(ref))

Fix init soundness #13472

Fix init soundness #13472

Conversation

liufengyun commented Sep 4, 2021

olhotak commented Sep 23, 2021

olhotak commented Sep 28, 2021

liufengyun commented Sep 29, 2021

olhotak commented Sep 30, 2021

olhotak commented Sep 30, 2021 • edited

liufengyun commented Sep 30, 2021

olhotak commented Oct 4, 2021

liufengyun commented Oct 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liufengyun commented Oct 12, 2021

olhotak commented Sep 30, 2021 •

edited