-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise how reasoner planner does answer count estimates #6822
base: development
Are you sure you want to change the base?
Revise how reasoner planner does answer count estimates #6822
Conversation
PR Review ChecklistDo not edit the content of this comment. The PR reviewer should simply update this comment by ticking each review item below, as they get completed. Trivial Change
Code
Architecture
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick round of self-review of the trivial changes. Self-review of the meat coming soon.
@@ -178,7 +178,7 @@ Map<Resolvable<?>, Double> resolvableHeuristics(ReasonerPlanner.CallMode callMod | |||
} else if (res.isNegated()) { | |||
estimate = iterate(res.asNegated().disjunction().conjunctions()).map(conj -> { | |||
Set<Variable> conjVariables = estimateableVariables(conj.pattern().variables()); | |||
Set<Variable> commonVars = Collections.intersection(conjVariables, resVars); | |||
Set<Variable> commonVars = Collections.intersection(conjVariables, conjunctionNode.conjunction().pattern().variables()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bugfix: Can apply to development separately if needed.
@@ -73,7 +73,7 @@ public void testOwnership() { | |||
|
|||
benchmark.assertAnswerCountCorrect(); | |||
benchmark.assertRunningTime(500); | |||
benchmark.assertCounters(10, NACCESS, NACCESS + 1, NACCESS + 1); | |||
benchmark.assertCounters(10, NACCESS, 2, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement, apparently.
.forEach(callMode -> plan(callMode)); | ||
} | ||
|
||
public void planRoot(ResolvableConjunction conjunction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New, neater interface imo
@@ -83,7 +83,8 @@ public void testMultipleStartingPoints() { | |||
|
|||
benchmark.assertAnswerCountCorrect(); | |||
benchmark.assertRunningTime(1000); | |||
benchmark.assertCounters(500, 17, 50, 120); | |||
benchmark.assertCountersBelow(500, 14, 100, 140, 1.0); | |||
benchmark.assertCountersAbove(12, 48, 80, 1.0); // TODO: Assert lower bound for CompoundStreams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to separate the variance was too high to do it in a single number.
The variance is caused indeed by the new scaling, but it seems to be the correct behaviour, since the plans being generated are equal cost as far as the planner can see
(The specific case is the variables involved in the inter-changeable constraints are bound, hence have estimate 1, so even our multiplicative cost cannot differentiate the plans)
@@ -200,7 +197,7 @@ public void test_owns_based_estimate_two_concludables() { | |||
ResolvableConjunction conjunction = ResolvableConjunction.of(resolvedConjunction("{ $m isa man, has name $n; $p isa! person, has $n; }", transaction.logic())); | |||
|
|||
double answers = answerCountEstimator.estimateAnswers(conjunction, getVariablesByName(conjunction.pattern(), set("m", "p"))); | |||
assertEquals(4.0, answers); | |||
assertEquals(2.0, answers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the fixes here are just to reflect the better scaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self review AnswerCountEstimator.
// We need to consider improved connectivity as a cause for propagation too. | ||
improvedVariableEstimates.merge(vertexEstimate.variable, | ||
Math.min(minVariableEstimate.getOrDefault(vertexEstimate.variable, Double.MAX_VALUE), vertexEstimate.estimate), | ||
Math::min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding the subgraph corresponding to the modelsForResolvable
, collect any variables whose answer set sizes have reduced, and propagate the effects along the connectivity edges to other variables.
this.rolePlayerTypes.put(player.player(), getRoleType(player)); | ||
}); | ||
} | ||
private static class EdgeEstimate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class for edges in the connectivity-graph.
Keeps an edge count, and node-estimates which can be scaled down when combining subgraphs and later used to compute the answer count estimate
nEdges = fromEstimate() != 0 ? (newFromEstimate / fromEstimate() * nEdges) : 0; | ||
nodeEstimates[idx] = newFromEstimate; | ||
nodeEstimates[other] = Math.min(nodeEstimates[other], nEdges); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we scale down the estimate for one of the vertices, we proportionally scale down the number of edges.
fringe.addAll(outgoingEdges.get(nextVar)); | ||
} | ||
return tree; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starts at an arbitrary vertex, greedily adds the lowest connectivity edge to compute a tree which gives us a low answer count estimate.
- It may not find the tree which gives the minimal answer count
- The answer count is minimised for the set of all variables, which may not translate to a minimal answer count for the queried variables.
|
||
// Returns connectivity, upper-bound for connectivity | ||
// Something with the adaptive minVariableEstimate and the way we upper bound is being non-deterministic. | ||
private Pair<Double, Double> effectiveConnectivityFromMdst(Set<Variable> queryVariables, Map<Variable, Set<LocalModel.EdgeEstimate.Directed>> tree, Variable root, double inputConnectivity, LocalModel.EdgeEstimate.Directed incomingEdge) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the directed most of the (unprojected) tree, computes its weight considering upper-bounds resulting from projecting out unqueried variables .
nPowerN *= toCorrect.size(); | ||
} | ||
connectivity *= (permutations / nPowerN); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction for duplicated role-players
return localModelFactory.modelForIs(asThingConstraint.asIs(), correspondingConcludable); | ||
} else throw TypeDBException.of(ILLEGAL_STATE); | ||
} else if (constraint.isValue()) { | ||
return new LocalModel(new HashSet<>(localModelFactory.variableModelsBasedOnType(new HashSet<>(constraint.asValue().variables()), null)), set()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the remaining predicates to model
575e665
to
aed08ef
Compare
4ba4bf0
to
f7740e7
Compare
cc25b3f
to
7ffe482
Compare
441aef5
to
eefed55
Compare
What is the goal of this PR?
This PR introduces a 'connectivity-graph' abstraction to simplify how the reasoner planner's answer count estimator models the effect of constraints in a conjunction on each other.
What are the changes implemented in this PR?
The reasoner planner was originally designed to estimate the answer count of a conjunction of constraints as the greedy-cover of the estimates of the variables by the individual constraints. Later, we would scale down estimates based on the variables it shared with other constraints. e.g.: If a constraint A had an estimate of A.x for a variable
$x
, And a constraint B had an estimate of B.x, such that B.x < A.x; We would scale the model of A by a factor (B.X/A.x) - applying the selectivity of B to A.The drawback of the greedy-cover is that the connectivity constraints were local, and would not propagate across multiple resolvables.[1]
In the new implementation, we create a "connectivity-graph" representing the expected answer graph. The graph reflects the structure of the query, with each vertex (or edge) labelled with the expected number of corresponding vertices (or edges) in the answer graph. Further, we do a minimal-spanning-tree style estimate to replace the greedy set-cover - by construction of the connectivity graph, the connectivity constraints apply across multiple resolvables.
Constructing the connectivity graph
The AnswerCountEstimator has a model for each constraint in the conjunction. Each of these can be mapped to a connectivity subgraph, where the model gives us the labels of the vertices and edges. e.g., The model for a
has
concludable will have an estimate for each variable and for the number ofhas
edges between them. In the corresponding connectivity (sub)graphs, these will be the labels of the vertices and edges respectively.When we combine two connectivity (sub)graphs, we "scale" down the variable labels on each so they agree on the estimates of common variables. We then propagate this by scaling down the number of edges connected to the variable. If the number of edges is less than the vertex on the other end, we scale it down to the number of edges and propagate it further.
Estimating answer count from the graph
The answer count of a conjunction is the set of all (combinations) of answers to the variables in the conjunction.
If$v(x)$ is the label for the vertex corresponding to $e(x, y)$ is the label for the (undirected) edge between $c(x, y) = e(x, y)/x$ .
$x
in the graph, and$x
and$y
, We define the directed connectivity from$x
to$y
asFirst, consider the simplified problem of finding the answers to all variables (v/s some subset) for a query with a tree structure.
We could start an any vertex
$x
and compute the product of connectivities for each edge in the tree rooted at$x
(for the direction root -> leaf).$ answers = v(x) \times \prod_{y\in adj(x)} { c(x,y) \prod_{z \in adj(y)} {c(y,z) * ...} } $
Two complications arise:
$x
we are connected to might exceed the number of unique vertices possible for$x
(i.e. $v(x)$) (see [3]). For this, we have to cap the product we're computing at each vertex so it does not exceedNote that the minimum spanning tree for the set of all variables may not be the same for a subset of the variables. Ideally, we would find a combined solution to both the problems.
[1] if we had 15 each of $x, $y, and $z. One $x is connected to exactly one unique $y through a constraint (x,y) and one $y is connected to one unique $z through ($y,$z). The greedy cover would pick both (x,y) and (y,z) but would have an estimate of 15 from (x,y) and 15 for (y,z), yielding an answer of 225. The correct answer would be to see that each $x is connected to one $y and one $z in turn, for a total answer count of 15.
[2]: When querying the answers to {$x,$y} for a query-graph ($x-$y-$z), We may have one $x, connected to five $y, each connected to one $z. The minimal spanning tree would have a weight of 5, whereas the number of answers cannot exceed 1 (the only pair of $x, $y which exist in the database).