Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink #2059

Open
RobinL opened this issue Mar 14, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@RobinL
Copy link
Member

RobinL commented Mar 14, 2024

There's a bug in truth_space_table_from_labels_column that means the count of true negatives is incorrect, causing knock on problems with various accuracy statistics.

Source of the bug:

  • Splink creates all comparisons according to the blocking rules specified
  • Splink adds to this list all remaining true matches that were not covered by the blocking rules
  • For all these rows, a match probability is computed, and a clerical match score is assigned (0 or 1, according to the labels)
  • It then compares the clerical match score (=1.0) to the estimated match_probability

This is used as the basis for all the calculations of TP,FP etc so the total number of rows (‘labels’) is based not off all possible combinations, but all the comparisons made.

I think the challenges are:

  • How to treat links which the model would score correctly but aren’t retrieved by the blocking rules. i.e. are we measuring the performance of the scoring algorithm, or the combined scoring and blocking algorithm
  • Some links not covered by the blocking rules could be false positives, but we can’t compute them all - so we only score links covered by the blocking rules

It looks like there’s a bug in the calculation whereby it should be overriding the match_probability and setting it to 0 in the case that found_by_blocking_rules = False, but it doesn’t do that, meaning that it’s overestimating the TP.

Also note that you don’t have the same problems with linker.truth_space_table_from_labels_table, because we can simply compute the score for each labelled pair, and use that as the basis for TP, TN, FP, FN
in practical applications having fully labelled data is rare, which is probably why no one’s worried to much about this before!!

example of labels column
import pandas as pd

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import exact_match
from splink.duckdb.linker import DuckDBLinker

pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000.head(6)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"])
    ],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        exact_match("dob"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
}

linker = DuckDBLinker(df, settings)
linker.debug_mode = True
linker.truth_space_table_from_labels_column("cluster").as_pandas_dataframe()
Example of labels table ```python import pandas as pd

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import exact_match
from splink.duckdb.linker import DuckDBLinker

pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000.head(6)
df
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"])
],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
exact_match("dob"),
],
"retain_intermediate_calculation_columns": True,
"additional_columns_to_retain": ["cluster"],
}

linker = DuckDBLinker(df, settings)
linker.predict().as_pandas_dataframe()
labels = [
{"unique_id_l": 0, "unique_id_r": 1, "clerical_match_score": 1.0},
{"unique_id_l": 1, "unique_id_r": 2, "clerical_match_score": 1.0},
]
lab = linker.register_labels_table(labels)
linker.truth_space_table_from_labels_table(lab).as_pandas_dataframe()

</details>
@RobinL RobinL changed the title Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink [BUG] Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink Apr 23, 2024
@RobinL RobinL added the bug Something isn't working label Apr 23, 2024
@RobinL
Copy link
Member Author

RobinL commented Apr 23, 2024

Step by step:

Consider data like:

unique_id first_name surname dob city email cluster
0 Robert Alan 1971-06-24 robert255@smith.net 0
1 Robert Allen 1971-05-24 roberta25@smith.net 0
2 Gobert Allen 1971-06-24 London roberta25@smith.net 0
3 Robert Alen 1971-06-24 Lonon 0
4 Grace 1997-04-26 Hull grace.kelly52@jones.com 1
5 Grace Kelly 1991-04-26 grace.kelly52@jones.com 1

Run a predict(), appending a cluster=cluster blocking rule

This gives us all rows recovered by blocking, plus all additional positive rows not recovered by blocking.

There are a lot of rows missed here. ALL THE MISSING ONES ARE ALL CORRECT: TRUE NEGATIVES - THEY'RE NOT RECOVERED BY BLOCKING AND THEY SHOULD BE SCORED AS 0.

Now have a table like:

clerical_match_score found_by_blocking_rules match_weight match_probability unique_id_l unique_id_r first_name_l first_name_r cluster_l cluster_r match_key
1 True -2.28757 0.170001 0 1 Robert Robert 0 0 0
1 False -2.28757 0.170001 0 2 Robert Gobert 0 0 1
1 True 2.71243 0.867624 1 3 Robert Robert 0 0 0
1 False 5.71243 0.981285 2 3 Gobert Robert 0 0 1
1 True 9.71243 0.998809 4 5 Grace Grace 1 1 0
1 True 9.71243 0.998809 0 3 Robert Robert 0 0 0
1 False 12.7124 0.999851 1 2 Robert Gobert 0 0 1

Then add c_P and c_N. These stand for c_P = "this row is clerically labelled as a positive". = clerical_positive
c_P = clerical_positive
c_N = clerical_negative

c_P c_N clerical_match_score found_by_blocking_rules match_weight unique_id_l unique_id_r match_key truth_threshold
1 0 1 True -2.28757 0 1 0 -2.28757
1 0 1 False -2.28757 0 2 1 -2.28757
1 0 1 True 2.71243 1 3 0 2.71243
1 0 1 False 5.71243 2 3 1 5.71243
1 0 1 True 9.71243 4 5 0 9.71243
1 0 1 True 9.71243 0 3 0 9.71243
1 0 1 False 12.7124 1 2 1 12.7124

Can now do something like:

truth_threshold num_records_in_row clerical_positive clerical_negative
-2.28757 2 2 0
2.71243 1 1 0
5.71243 1 1 0
9.71243 2 2 0
12.7124 1 1 0

Which says: "For records scored at the truth threshold (splink score threshold) of -2.28757, two of our comparisons are in fact matches

Note there are no clerical negatives here. This is the fundamental problem.

!! We need to add to clerical_negative the (total number of comparisons - total clerical positive labels) !!

truth_threshold cum_clerical_P cum_clerical_N total_clerical_P total_clerical_N row_count N_labels P_labels
-2.28757 7 0 7 0 7 0 7
2.71243 5 0 7 0 7 2 5
5.71243 4 0 7 0 7 3 4
9.71243 3 0 7 0 7 4 3
12.7124 1 0 7 0 7 6 1

@RobinL
Copy link
Member Author

RobinL commented Apr 23, 2024

__splink__labels_with_pos_neg_grouped_with_stats_adj is being computed wrong.

we end up with no true negatives and lots of false negatives, which is wrong

Need to properly understand and define

as cum_clerical_P
as cum_clerical_N
as total_clerical_P
as total_clerical_N
as row_count
as N_labels

to work out the correct adjustments

Original Name Descriptive Name
cum_clerical_P cumulative_clerical_positives_at_or_above_threshold
cum_clerical_N cumulative_clerical_negatives_below_threshold
total_clerical_P total_clerical_positives
total_clerical_N total_clerical_negatives
row_count total_clerical_labels
N_labels num_labels_scored_below_threshold
P_labels num_labels_scored_at_or_above_threshold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant