Dataflow: Implement call context grouping to improve performance #16456

aschackmull · 2024-05-08T13:51:21Z

This replaces our call context representation with a set-based representation of the allowed call edges, thus unifying equivalent call contexts. This unification means that we avoid redundant computation when a callable is reachable with several different, but equivalent call contexts.

In some cases I've observed 3000 equivalent call contexts, which will now be replaced by a single entity, and in that particular case the total tuple count for the stage was cut in half.

Commit-by-commit review encouraged. The first commit introduces the MakeSets primitive, which is then used to collapse local call contexts. Then a sequence of refactoring commits follow. And finally, the "Dataflow: Switch call context to a set representation" commit contains the implementation of and switch to fully set-based call contexts. This commit also does some reshuffling to ensure proper caching in CachedCallContextSensitivity.

java/ql/lib/semmle/code/java/dataflow/internal/DataFlowPrivate.qll

shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

+
+      import CallContextSensitivity<CallContextSensitivityInput>
+      import LocalCallContext
+    }


shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

+    }
+
+    private class CallContext = PrunedCallContextSensitivityStage5::Cc;
+


shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

-          predicate reducedViableImplInReturnCand =
-            CachedCallContextSensitivity::reducedViableImplInReturn/2;
-        }
+        predicate reducedViableImplInCallContextCand =


shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll


-        import CallContextSensitivity<CallContextSensitivityInput>
+        predicate reducedViableImplInReturnCand =


shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

-          predicate reducedViableImplInReturnCand =
-            Stage3Param::Level1CallContextInput::reducedViableImplInReturn/2;
-        }
+        predicate reducedViableImplInCallContextCand =


shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll


-        import CallContextSensitivity<CallContextSensitivityInput>
+        predicate reducedViableImplInReturnCand = Stage3Param::reducedViableImplInReturn/2;


shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

+
+      predicate callContextNone = CachedCallContextSensitivity::ccNone/0;
+
+      predicate callContextSomeCall = CachedCallContextSensitivity::ccSomeCall/0;


owen-mc · 2024-05-22T10:26:18Z

go/ql/lib/semmle/go/dataflow/internal/DataFlowPrivate.qll

+  int totalorder() {
+    this =
+      rank[result](DataFlowCallable c, string file, int startline, int startcolumn |
+        c.hasLocationInfo(file, startline, startcolumn, _, _)


Why does the total order for DataFlowCallable take the file into account, but not those for DataFlowCall or NodeRegion?

That's because we're ordering calls and regions within a specific callable body, so the ones we need to compare are always in the same file. But the callables that we need to compare can be spread across multiple files, so they benefit from the additional comparison column.
My hope is that this ordering is fairly temporary, as I'd like to get the MakeSets module turned into a QL primitive - once it's implemented in the evaluator then there's no need for these arbitrary orderings.

That makes sense. In the cases where we are only comparing within the same file, would it make sense to limit the ordering to other instances in the same file? Or do you really want to have a total ordering over all calls, say, even if we only compare ones in the same file?

A best-effort total ordering is simpler than trying to make it specifically per file - in the end it doesn't matter much, we just need some completely arbitrary ordering, and even in the cases where we fail to get this, the MakeSets module provides a fallback solution.

…xtSensitivity.

…Sensitivity.

…xtReduced

…ContextReduced.

…llContextReducedReverse.

…Common::CallContextSensitivity.

hvitved

Overall approach LGTM. A few comments/suggestions.

shared/util/codeql/util/internal/MakeSets.qll

shared/dataflow/codeql/dataflow/internal/DataFlowImpl.qll

shared/dataflow/codeql/dataflow/DataFlow.qll

hvitved · 2024-05-22T12:54:52Z

java/ql/lib/semmle/code/java/dataflow/internal/DataFlowPrivate.qll

+private predicate idOf(BasicBlock x, int y) = equivalenceRelation(id/2)(x, y)
+
+class NodeRegion instanceof BasicBlock {
+  string toString() { result = "NodeRegion" }


Use final extends instead?

That feels slightly wrong to me, since a NodeRegion isn't necessarily a basic block - it could also be defined as e.g. a set of nodes guarded by a given guard. Currently the data flow side of the implementation assumes that NodeRegions are disjoint, but that doesn't have to be the case. It could conceivably be useful to define them differently if we run into guards that guard a huge number of basic blocks.

But its are purely internal implementation detail, right, so we can always change it as we see fit? I merely suggested to use final extends to avoid the dummy toString definitions.

final extends needs a dummy alias instead, so there's not much saved. And it still feels wrong for the stated reason: A basic block is certainly a region of nodes, but a region of nodes is not necessarily a basic block.

hvitved · 2024-05-22T12:58:09Z

go/ql/lib/semmle/go/dataflow/internal/DataFlowPrivate.qll

+class NodeRegion instanceof BasicBlock {
+  string toString() { result = "NodeRegion" }


final extends?

shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

hvitved · 2024-05-23T11:57:58Z

shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

@@ -522,14 +502,155 @@ module MakeImplCommon<LocationSig Location, InputSig<Location> Lang> {
      )
    }

+    private module CallSetsInput implements MkSetsInp {


I see that this is needed because DispatchSets does not handle the case when there are no dispatch targets. However, I think it would be more explicit if we had something like

private module DispatchSetsInput implements MkSetsInp { class Key = TCallEdge; class Value = Option<TCallEdge>::Option; Value getAValue(TCallEdge ctxEdge) { exists(DataFlowCall ctx, DataFlowCallable c, DataFlowCall call, DataFlowCallable tgt | ctxEdge = TMkCallEdge(ctx, c) and reducedViableImplInCallContext(call, c, ctx) | result.asSome() = TMkCallEdge(call, tgt) and viableImplInCallContextExtIn(call, ctx) = tgt or not exists(viableImplInCallContextExtIn(call, ctx)) and result.isNone() ) } }

Something like that could work as well, but I chose the current version to avoid wrapping all of the DispatchSet edges in 'some'-wrappers. Also, it becomes quite tricky to get right if we stuff everything together like that - your version above e.g. fails to specify which calls have no dispatch, so an option-type doesn't carry enough information. For a given call affected by the context, there's either a non-empty set of dispatch edges or an empty set, but that empty set cannot be represented by a none value, because then we don't know what call it refers to.

Right, the Option would have to apply to the Callable in TCallEdge. Let's leave it as-is then. I think this section deserves a comment then about why we need both CallSets and DispatchSets.

shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

aschackmull · 2024-05-24T09:30:57Z

Overall approach LGTM. A few comments/suggestions.

All comments should be addressed now.

aschackmull · 2024-05-24T12:01:21Z

More comments addressed. (Including one change that had somehow slipped from the first commit).

shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

+
+    private module CallSets = MakeSets<CallSetsInput>;
+
+    private module CallSetOption = Option<CallSets::ValueSet>;


shared/dataflow/codeql/dataflow/internal/DataFlowImplCommon.qll

+
+    private module DispatchSets = MakeSets<DispatchSetsInput>;
+
+    private module DispatchSetsOption = Option<DispatchSets::ValueSet>;


hvitved

The latest DCA run for C# seems to be broken. The Ruby DCA run shows a nice speedup on gh/gh.

Perhaps it is worth running a final batch of DCA for all languages before merging?

aschackmull · 2024-05-24T13:00:03Z

Perhaps it is worth running a final batch of DCA for all languages before merging?

Sure. Running now.

github-actions bot added C# C++ Java Python Go Ruby Swift DataFlow Library labels May 8, 2024

github-advanced-security bot found potential problems May 8, 2024

View reviewed changes

aschackmull force-pushed the dataflow/callcontext-grouping branch 2 times, most recently from 985b8f0 to 0520839 Compare May 14, 2024 13:06

github-advanced-security bot found potential problems May 14, 2024

View reviewed changes

aschackmull marked this pull request as ready for review May 15, 2024 14:00

aschackmull requested review from a team as code owners May 15, 2024 14:00

aschackmull added the no-change-note-required This PR does not need a change note label May 15, 2024

aschackmull force-pushed the dataflow/callcontext-grouping branch from 0604d21 to ab1f10a Compare May 22, 2024 08:22

owen-mc reviewed May 22, 2024

View reviewed changes

aschackmull added 7 commits May 23, 2024 10:21

Shared: Add MakeSets module.

722da70

Dataflow: Introduce NodeRegions for use in isUnreachableInCall.

30fa1c8

Dataflow: Remove duplicate definitions

2b68380

Dataflow: Switch local call contexts to use canonical representative.

4916b7e

Dataflow: Switch column order in viableImplCallContextReducedReverse.

7e19516

Dataflow: Refactor getLocalCc to avoid reference to NodeEx.

6647ac8

Dataflow: Share getCallContextReturn in DataFlowImplCommon::CallConte…

b4e3219

…xtSensitivity.

aschackmull added 16 commits May 23, 2024 10:21

Dataflow: Share getCallContextCall in DataFlowImplCommon::CallContext…

c86f633

…Sensitivity.

Dataflow: Rename prunedViableImplInCallContext to viableImplCallConte…

cd6c455

…xtReduced

Dataflow: Rename noPrunedViableImplInCallContext to viableImplNotCall…

1a3fef3

…ContextReduced.

Dataflow: Rename prunedViableImplInCallContextReverse to viableImplCa…

2d1de60

…llContextReducedReverse.

Dataflow: Move viableImplNotCallContextReducedReverse to DataFlowImpl…

a25b9a8

…Common::CallContextSensitivity.

Dataflow: Simplify.

ae733d2

Dataflow: Simplify: remove Level1CallContextInput module

d45a4c2

Dataflow: Move Level1CallContext to DataFlowImplCommon

2d9cb9d

Dataflow: Move two declarations.

5e31117

Dataflow: Make CallContext type private to DataFlowImplCommon.

75809e3

Dataflow: Make two predicates private.

7f2f308

Util: Allow best-effort total orders with a reasonable fallback.

667692a

Dataflow: Switch call context to a set representation.

855ed32

Dataflow: Add totalorder predicates to all languages.

b095bbf

C++/C#/Java: Update expected output.

7456094

C++/Shared: Fix join order issues.

34b82fa

aschackmull force-pushed the dataflow/callcontext-grouping branch from ab1f10a to 34b82fa Compare May 23, 2024 08:24

hvitved reviewed May 23, 2024

View reviewed changes

Dataflow: Address review comments.

fc89de0

Dataflow: Address review comments (take 2).

ee4f070

github-advanced-security bot found potential problems May 24, 2024

View reviewed changes

hvitved reviewed May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow: Implement call context grouping to improve performance #16456

Dataflow: Implement call context grouping to improve performance #16456

aschackmull commented May 8, 2024

owen-mc May 22, 2024

aschackmull May 22, 2024

owen-mc May 22, 2024

aschackmull May 22, 2024

hvitved left a comment

hvitved May 22, 2024

aschackmull May 24, 2024

hvitved May 24, 2024

aschackmull May 24, 2024

hvitved May 22, 2024

hvitved May 23, 2024

aschackmull May 24, 2024

hvitved May 24, 2024

aschackmull commented May 24, 2024

aschackmull commented May 24, 2024

hvitved left a comment

aschackmull commented May 24, 2024

		}

		private class CallContext = PrunedCallContextSensitivityStage5::Cc;


		import CallContextSensitivity<CallContextSensitivityInput>
		predicate reducedViableImplInReturnCand =


		predicate callContextNone = CachedCallContextSensitivity::ccNone/0;

		predicate callContextSomeCall = CachedCallContextSensitivity::ccSomeCall/0;

		class NodeRegion instanceof BasicBlock {
		string toString() { result = "NodeRegion" }


		private module CallSets = MakeSets<CallSetsInput>;

		private module CallSetOption = Option<CallSets::ValueSet>;


		private module DispatchSets = MakeSets<DispatchSetsInput>;

		private module DispatchSetsOption = Option<DispatchSets::ValueSet>;

Dataflow: Implement call context grouping to improve performance #16456

Are you sure you want to change the base?

Dataflow: Implement call context grouping to improve performance #16456

Conversation

aschackmull commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvitved left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aschackmull commented May 24, 2024

aschackmull commented May 24, 2024

hvitved left a comment

Choose a reason for hiding this comment

aschackmull commented May 24, 2024