Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise how reasoner planner does answer count estimates #6822

Draft
wants to merge 13 commits into
base: development
Choose a base branch
from

Conversation

krishnangovindraj
Copy link
Member

@krishnangovindraj krishnangovindraj commented Jun 8, 2023

What is the goal of this PR?

This PR introduces a 'connectivity-graph' abstraction to simplify how the reasoner planner's answer count estimator models the effect of constraints in a conjunction on each other.

What are the changes implemented in this PR?

The reasoner planner was originally designed to estimate the answer count of a conjunction of constraints as the greedy-cover of the estimates of the variables by the individual constraints. Later, we would scale down estimates based on the variables it shared with other constraints. e.g.: If a constraint A had an estimate of A.x for a variable $x, And a constraint B had an estimate of B.x, such that B.x < A.x; We would scale the model of A by a factor (B.X/A.x) - applying the selectivity of B to A.

The drawback of the greedy-cover is that the connectivity constraints were local, and would not propagate across multiple resolvables.[1]

In the new implementation, we create a "connectivity-graph" representing the expected answer graph. The graph reflects the structure of the query, with each vertex (or edge) labelled with the expected number of corresponding vertices (or edges) in the answer graph. Further, we do a minimal-spanning-tree style estimate to replace the greedy set-cover - by construction of the connectivity graph, the connectivity constraints apply across multiple resolvables.

Constructing the connectivity graph

The AnswerCountEstimator has a model for each constraint in the conjunction. Each of these can be mapped to a connectivity subgraph, where the model gives us the labels of the vertices and edges. e.g., The model for a has concludable will have an estimate for each variable and for the number of has edges between them. In the corresponding connectivity (sub)graphs, these will be the labels of the vertices and edges respectively.

When we combine two connectivity (sub)graphs, we "scale" down the variable labels on each so they agree on the estimates of common variables. We then propagate this by scaling down the number of edges connected to the variable. If the number of edges is less than the vertex on the other end, we scale it down to the number of edges and propagate it further.

Estimating answer count from the graph

The answer count of a conjunction is the set of all (combinations) of answers to the variables in the conjunction.

If $v(x)$ is the label for the vertex corresponding to $x in the graph, and $e(x, y)$ is the label for the (undirected) edge between $x and $y, We define the directed connectivity from $x to $y as $c(x, y) = e(x, y)/x$.

First, consider the simplified problem of finding the answers to all variables (v/s some subset) for a query with a tree structure.
We could start an any vertex $x and compute the product of connectivities for each edge in the tree rooted at $x (for the direction root -> leaf).

$ answers = v(x) \times \prod_{y\in adj(x)} { c(x,y) \prod_{z \in adj(y)} {c(y,z) * ...} } $

Two complications arise:

  1. When the query is not tree structured, neither is our connectivity graph, and there is not a unique tree. We ideally would do the same for the tree which yields the minimum result for the $answers$ . This corresponds to the minimum weight directed spanning tree. Since we're taking the products of connectivity, the weights of the edge have to be the log of the actual connectivities of the directed edges.
  2. When we are only querying the answer count of a subset of variables, we cannot simply multiply the connectivities because the estimate for how many vertices of variable $x we are connected to might exceed the number of unique vertices possible for $x (i.e. $v(x)$) (see [3]). For this, we have to cap the product we're computing at each vertex so it does not exceed $v(x)$ %

Note that the minimum spanning tree for the set of all variables may not be the same for a subset of the variables. Ideally, we would find a combined solution to both the problems.

[1] if we had 15 each of $x, $y, and $z. One $x is connected to exactly one unique $y through a constraint (x,y) and one $y is connected to one unique $z through ($y,$z). The greedy cover would pick both (x,y) and (y,z) but would have an estimate of 15 from (x,y) and 15 for (y,z), yielding an answer of 225. The correct answer would be to see that each $x is connected to one $y and one $z in turn, for a total answer count of 15.

[2]: When querying the answers to {$x,$y} for a query-graph ($x-$y-$z), We may have one $x, connected to five $y, each connected to one $z. The minimal spanning tree would have a weight of 5, whereas the number of answers cannot exceed 1 (the only pair of $x, $y which exist in the database).

@vaticle-bot
Copy link
Member

PR Review Checklist

Do not edit the content of this comment. The PR reviewer should simply update this comment by ticking each review item below, as they get completed.


Trivial Change

  • This change is trivial and does not require a code or architecture review.

Code

  • Packages, classes, and methods have a single domain of responsibility.
  • Packages, classes, and methods are grouped into cohesive and consistent domain model.
  • The code is canonical and the minimum required to achieve the goal.
  • Modules, libraries, and APIs are easy to use, robust (foolproof and not errorprone), and tested.
  • Logic and naming has clear narrative that communicates the accurate intent and responsibility of each module (e.g. method, class, etc.).
  • The code is algorithmically efficient and scalable for the whole application.

Architecture

  • Any required refactoring is completed, and the architecture does not introduce technical debt incidentally.
  • Any required build and release automations are updated and/or implemented.
  • Any new components follows a consistent style with respect to the pre-existing codebase.
  • The architecture intuitively reflects the application domain, and is easy to understand.
  • The architecture has a well-defined hierarchy of encapsulated components.
  • The architecture is extensible and scalable.

Copy link
Member Author

@krishnangovindraj krishnangovindraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick round of self-review of the trivial changes. Self-review of the meat coming soon.

@@ -178,7 +178,7 @@ Map<Resolvable<?>, Double> resolvableHeuristics(ReasonerPlanner.CallMode callMod
} else if (res.isNegated()) {
estimate = iterate(res.asNegated().disjunction().conjunctions()).map(conj -> {
Set<Variable> conjVariables = estimateableVariables(conj.pattern().variables());
Set<Variable> commonVars = Collections.intersection(conjVariables, resVars);
Set<Variable> commonVars = Collections.intersection(conjVariables, conjunctionNode.conjunction().pattern().variables());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugfix: Can apply to development separately if needed.

@@ -73,7 +73,7 @@ public void testOwnership() {

benchmark.assertAnswerCountCorrect();
benchmark.assertRunningTime(500);
benchmark.assertCounters(10, NACCESS, NACCESS + 1, NACCESS + 1);
benchmark.assertCounters(10, NACCESS, 2, 2);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement, apparently.

.forEach(callMode -> plan(callMode));
}

public void planRoot(ResolvableConjunction conjunction) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New, neater interface imo

@@ -83,7 +83,8 @@ public void testMultipleStartingPoints() {

benchmark.assertAnswerCountCorrect();
benchmark.assertRunningTime(1000);
benchmark.assertCounters(500, 17, 50, 120);
benchmark.assertCountersBelow(500, 14, 100, 140, 1.0);
benchmark.assertCountersAbove(12, 48, 80, 1.0); // TODO: Assert lower bound for CompoundStreams
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to separate the variance was too high to do it in a single number.

The variance is caused indeed by the new scaling, but it seems to be the correct behaviour, since the plans being generated are equal cost as far as the planner can see
(The specific case is the variables involved in the inter-changeable constraints are bound, hence have estimate 1, so even our multiplicative cost cannot differentiate the plans)

@@ -200,7 +197,7 @@ public void test_owns_based_estimate_two_concludables() {
ResolvableConjunction conjunction = ResolvableConjunction.of(resolvedConjunction("{ $m isa man, has name $n; $p isa! person, has $n; }", transaction.logic()));

double answers = answerCountEstimator.estimateAnswers(conjunction, getVariablesByName(conjunction.pattern(), set("m", "p")));
assertEquals(4.0, answers);
assertEquals(2.0, answers);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the fixes here are just to reflect the better scaling.

Copy link
Member Author

@krishnangovindraj krishnangovindraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review AnswerCountEstimator.

// We need to consider improved connectivity as a cause for propagation too.
improvedVariableEstimates.merge(vertexEstimate.variable,
Math.min(minVariableEstimate.getOrDefault(vertexEstimate.variable, Double.MAX_VALUE), vertexEstimate.estimate),
Math::min);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding the subgraph corresponding to the modelsForResolvable, collect any variables whose answer set sizes have reduced, and propagate the effects along the connectivity edges to other variables.

this.rolePlayerTypes.put(player.player(), getRoleType(player));
});
}
private static class EdgeEstimate {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class for edges in the connectivity-graph.
Keeps an edge count, and node-estimates which can be scaled down when combining subgraphs and later used to compute the answer count estimate

nEdges = fromEstimate() != 0 ? (newFromEstimate / fromEstimate() * nEdges) : 0;
nodeEstimates[idx] = newFromEstimate;
nodeEstimates[other] = Math.min(nodeEstimates[other], nEdges);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we scale down the estimate for one of the vertices, we proportionally scale down the number of edges.

fringe.addAll(outgoingEdges.get(nextVar));
}
return tree;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starts at an arbitrary vertex, greedily adds the lowest connectivity edge to compute a tree which gives us a low answer count estimate.

  1. It may not find the tree which gives the minimal answer count
  2. The answer count is minimised for the set of all variables, which may not translate to a minimal answer count for the queried variables.


// Returns connectivity, upper-bound for connectivity
// Something with the adaptive minVariableEstimate and the way we upper bound is being non-deterministic.
private Pair<Double, Double> effectiveConnectivityFromMdst(Set<Variable> queryVariables, Map<Variable, Set<LocalModel.EdgeEstimate.Directed>> tree, Variable root, double inputConnectivity, LocalModel.EdgeEstimate.Directed incomingEdge) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the directed most of the (unprojected) tree, computes its weight considering upper-bounds resulting from projecting out unqueried variables .

nPowerN *= toCorrect.size();
}
connectivity *= (permutations / nPowerN);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction for duplicated role-players

return localModelFactory.modelForIs(asThingConstraint.asIs(), correspondingConcludable);
} else throw TypeDBException.of(ILLEGAL_STATE);
} else if (constraint.isValue()) {
return new LocalModel(new HashSet<>(localModelFactory.variableModelsBasedOnType(new HashSet<>(constraint.asValue().variables()), null)), set());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the remaining predicates to model

@flyingsilverfin flyingsilverfin added this to the 2.19.0 milestone Jun 16, 2023
@flyingsilverfin flyingsilverfin marked this pull request as draft June 21, 2023 12:02
@krishnangovindraj krishnangovindraj force-pushed the answer-count-estimator/bounded-mst-estimate branch from 575e665 to aed08ef Compare July 7, 2023 13:12
@krishnangovindraj krishnangovindraj marked this pull request as ready for review July 7, 2023 15:00
@krishnangovindraj krishnangovindraj marked this pull request as draft July 7, 2023 15:00
@flyingsilverfin flyingsilverfin modified the milestones: 2.19.0, 2.20.0 Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants