-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Bug in truth_space_table_from_labels_column
/Calculations with Blocking Rules in Splink
#2059
Comments
truth_space_table_from_labels_column
/Calculations with Blocking Rules in Splinktruth_space_table_from_labels_column
/Calculations with Blocking Rules in Splink
Step by step: Consider data like:
Run a predict(), appending a cluster=cluster blocking rule This gives us all rows recovered by blocking, plus all additional positive rows not recovered by blocking. There are a lot of rows missed here. ALL THE MISSING ONES ARE ALL CORRECT: TRUE NEGATIVES - THEY'RE NOT RECOVERED BY BLOCKING AND THEY SHOULD BE SCORED AS 0. Now have a table like:
Then add c_P and c_N. These stand for c_P = "this row is clerically labelled as a positive". = clerical_positive
Can now do something like:
Which says: "For records scored at the truth threshold (splink score threshold) of -2.28757, two of our comparisons are in fact matches Note there are no clerical negatives here. This is the fundamental problem. !! We need to add to
|
__splink__labels_with_pos_neg_grouped_with_stats_adj is being computed wrong. we end up with no true negatives and lots of false negatives, which is wrong Need to properly understand and define
to work out the correct adjustments
|
There's a bug in
truth_space_table_from_labels_column
that means the count of true negatives is incorrect, causing knock on problems with various accuracy statistics.Source of the bug:
This is used as the basis for all the calculations of TP,FP etc so the total number of rows (‘labels’) is based not off all possible combinations, but all the comparisons made.
I think the challenges are:
It looks like there’s a bug in the calculation whereby it should be overriding the match_probability and setting it to 0 in the case that found_by_blocking_rules = False, but it doesn’t do that, meaning that it’s overestimating the TP.
Also note that you don’t have the same problems with linker.truth_space_table_from_labels_table, because we can simply compute the score for each labelled pair, and use that as the basis for TP, TN, FP, FN
in practical applications having fully labelled data is rare, which is probably why no one’s worried to much about this before!!
example of labels column
Example of labels table
```python import pandas as pdfrom splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import exact_match
from splink.duckdb.linker import DuckDBLinker
pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000.head(6)
df
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"])
],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
exact_match("dob"),
],
"retain_intermediate_calculation_columns": True,
"additional_columns_to_retain": ["cluster"],
}
linker = DuckDBLinker(df, settings)
linker.predict().as_pandas_dataframe()
labels = [
{"unique_id_l": 0, "unique_id_r": 1, "clerical_match_score": 1.0},
{"unique_id_l": 1, "unique_id_r": 2, "clerical_match_score": 1.0},
]
lab = linker.register_labels_table(labels)
linker.truth_space_table_from_labels_table(lab).as_pandas_dataframe()
The text was updated successfully, but these errors were encountered: