Revise how reasoner planner does answer count estimates #6822

krishnangovindraj · 2023-06-08T15:57:31Z

What is the goal of this PR?

This PR introduces a 'connectivity-graph' abstraction to simplify how the reasoner planner's answer count estimator models the effect of constraints in a conjunction on each other.

What are the changes implemented in this PR?

The reasoner planner was originally designed to estimate the answer count of a conjunction of constraints as the greedy-cover of the estimates of the variables by the individual constraints. Later, we would scale down estimates based on the variables it shared with other constraints. e.g.: If a constraint A had an estimate of A.x for a variable $x, And a constraint B had an estimate of B.x, such that B.x < A.x; We would scale the model of A by a factor (B.X/A.x) - applying the selectivity of B to A.

The drawback of the greedy-cover is that the connectivity constraints were local, and would not propagate across multiple resolvables.[1]

In the new implementation, we create a "connectivity-graph" representing the expected answer graph. The graph reflects the structure of the query, with each vertex (or edge) labelled with the expected number of corresponding vertices (or edges) in the answer graph. Further, we do a minimal-spanning-tree style estimate to replace the greedy set-cover - by construction of the connectivity graph, the connectivity constraints apply across multiple resolvables.

Constructing the connectivity graph

The AnswerCountEstimator has a model for each constraint in the conjunction. Each of these can be mapped to a connectivity subgraph, where the model gives us the labels of the vertices and edges. e.g., The model for a has concludable will have an estimate for each variable and for the number of has edges between them. In the corresponding connectivity (sub)graphs, these will be the labels of the vertices and edges respectively.

When we combine two connectivity (sub)graphs, we "scale" down the variable labels on each so they agree on the estimates of common variables. We then propagate this by scaling down the number of edges connected to the variable. If the number of edges is less than the vertex on the other end, we scale it down to the number of edges and propagate it further.

Estimating answer count from the graph

The answer count of a conjunction is the set of all (combinations) of answers to the variables in the conjunction.

If $v(x)$ is the label for the vertex corresponding to $x in the graph, and $e(x, y)$ is the label for the (undirected) edge between $x and $y, We define the directed connectivity from $x to $y as $c(x, y) = e(x, y)/x$.

First, consider the simplified problem of finding the answers to all variables (v/s some subset) for a query with a tree structure.
We could start an any vertex $x and compute the product of connectivities for each edge in the tree rooted at $x (for the direction root -> leaf).

$ answers = v(x) \times \prod_{y\in adj(x)} { c(x,y) \prod_{z \in adj(y)} {c(y,z) * ...} } $

Two complications arise:

When the query is not tree structured, neither is our connectivity graph, and there is not a unique tree. We ideally would do the same for the tree which yields the minimum result for the $answers$ . This corresponds to the minimum weight directed spanning tree. Since we're taking the products of connectivity, the weights of the edge have to be the log of the actual connectivities of the directed edges.
When we are only querying the answer count of a subset of variables, we cannot simply multiply the connectivities because the estimate for how many vertices of variable $x we are connected to might exceed the number of unique vertices possible for $x (i.e. $v(x)$) (see [3]). For this, we have to cap the product we're computing at each vertex so it does not exceed $v(x)$ %

Note that the minimum spanning tree for the set of all variables may not be the same for a subset of the variables. Ideally, we would find a combined solution to both the problems.

[1] if we had 15 each of $x, $y, and $z. One $x is connected to exactly one unique $y through a constraint (x,y) and one $y is connected to one unique $z through ($y,$z). The greedy cover would pick both (x,y) and (y,z) but would have an estimate of 15 from (x,y) and 15 for (y,z), yielding an answer of 225. The correct answer would be to see that each $x is connected to one $y and one $z in turn, for a total answer count of 15.

[2]: When querying the answers to {$x,$y} for a query-graph ($x-$y-$z), We may have one $x, connected to five $y, each connected to one $z. The minimal spanning tree would have a weight of 5, whereas the number of answers cannot exceed 1 (the only pair of $x, $y which exist in the database).

vaticle-bot · 2023-06-08T15:57:34Z

krishnangovindraj

Quick round of self-review of the trivial changes. Self-review of the meat coming soon.

krishnangovindraj · 2023-06-08T17:03:47Z

reasoner/planner/OrderingCoster.java

@@ -178,7 +178,7 @@ Map<Resolvable<?>, Double> resolvableHeuristics(ReasonerPlanner.CallMode callMod
            } else if (res.isNegated()) {
                estimate = iterate(res.asNegated().disjunction().conjunctions()).map(conj -> {
                    Set<Variable> conjVariables = estimateableVariables(conj.pattern().variables());
-                    Set<Variable> commonVars = Collections.intersection(conjVariables, resVars);
+                    Set<Variable> commonVars = Collections.intersection(conjVariables, conjunctionNode.conjunction().pattern().variables());


Bugfix: Can apply to development separately if needed.

krishnangovindraj · 2023-06-08T17:07:20Z

test/benchmark/reasoner/iam/basic/BasicTest.java

@@ -73,7 +73,7 @@ public void testOwnership() {

        benchmark.assertAnswerCountCorrect();
        benchmark.assertRunningTime(500);
-        benchmark.assertCounters(10, NACCESS, NACCESS + 1, NACCESS + 1);
+        benchmark.assertCounters(10, NACCESS, 2, 2);


Improvement, apparently.

krishnangovindraj · 2023-06-08T17:07:31Z

reasoner/planner/ReasonerPlanner.java

+            .forEach(callMode -> plan(callMode));
+    }
+
+    public void planRoot(ResolvableConjunction conjunction) {


New, neater interface imo

krishnangovindraj · 2023-06-08T17:10:20Z

test/benchmark/reasoner/iam/complex/ConjunctionStructureTest.java

@@ -83,7 +83,8 @@ public void testMultipleStartingPoints() {

        benchmark.assertAnswerCountCorrect();
        benchmark.assertRunningTime(1000);
-        benchmark.assertCounters(500, 17, 50, 120);
+        benchmark.assertCountersBelow(500, 14, 100, 140, 1.0);
+        benchmark.assertCountersAbove(12, 48, 80, 1.0); // TODO: Assert lower bound for CompoundStreams


Had to separate the variance was too high to do it in a single number.

The variance is caused indeed by the new scaling, but it seems to be the correct behaviour, since the plans being generated are equal cost as far as the planner can see
(The specific case is the variables involved in the inter-changeable constraints are bound, hence have estimate 1, so even our multiplicative cost cannot differentiate the plans)

krishnangovindraj · 2023-06-08T17:11:15Z

test/integration/reasoner/planner/AnswerCountEstimatorTest.java

@@ -200,7 +197,7 @@ public void test_owns_based_estimate_two_concludables() {
            ResolvableConjunction conjunction = ResolvableConjunction.of(resolvedConjunction("{ $m isa man, has name $n; $p isa! person, has $n; }", transaction.logic()));

            double answers = answerCountEstimator.estimateAnswers(conjunction, getVariablesByName(conjunction.pattern(), set("m", "p")));
-            assertEquals(4.0, answers);
+            assertEquals(2.0, answers);


All the fixes here are just to reflect the better scaling.

krishnangovindraj

Self review AnswerCountEstimator.

krishnangovindraj · 2023-06-09T10:22:41Z