[BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink #2059

RobinL · 2024-03-14T16:31:57Z

There's a bug in truth_space_table_from_labels_column that means the count of true negatives is incorrect, causing knock on problems with various accuracy statistics.

Source of the bug:

Splink creates all comparisons according to the blocking rules specified
Splink adds to this list all remaining true matches that were not covered by the blocking rules
For all these rows, a match probability is computed, and a clerical match score is assigned (0 or 1, according to the labels)
It then compares the clerical match score (=1.0) to the estimated match_probability

This is used as the basis for all the calculations of TP,FP etc so the total number of rows (‘labels’) is based not off all possible combinations, but all the comparisons made.

I think the challenges are:

How to treat links which the model would score correctly but aren’t retrieved by the blocking rules. i.e. are we measuring the performance of the scoring algorithm, or the combined scoring and blocking algorithm
Some links not covered by the blocking rules could be false positives, but we can’t compute them all - so we only score links covered by the blocking rules

It looks like there’s a bug in the calculation whereby it should be overriding the match_probability and setting it to 0 in the case that found_by_blocking_rules = False, but it doesn’t do that, meaning that it’s overestimating the TP.

Also note that you don’t have the same problems with linker.truth_space_table_from_labels_table, because we can simply compute the score for each labelled pair, and use that as the basis for TP, TN, FP, FN
in practical applications having fully labelled data is rare, which is probably why no one’s worried to much about this before!!

example of labels column

import pandas as pd

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import exact_match
from splink.duckdb.linker import DuckDBLinker

pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000.head(6)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"])
    ],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        exact_match("dob"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
}

linker = DuckDBLinker(df, settings)
linker.debug_mode = True
linker.truth_space_table_from_labels_column("cluster").as_pandas_dataframe()

Example of labels table

```python import pandas as pd

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import exact_match
from splink.duckdb.linker import DuckDBLinker

pd.options.display.max_columns = 1000
df = splink_datasets.fake_1000.head(6)
df
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"])
],
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
exact_match("dob"),
],
"retain_intermediate_calculation_columns": True,
"additional_columns_to_retain": ["cluster"],
}

linker = DuckDBLinker(df, settings)
linker.predict().as_pandas_dataframe()
labels = [
{"unique_id_l": 0, "unique_id_r": 1, "clerical_match_score": 1.0},
{"unique_id_l": 1, "unique_id_r": 2, "clerical_match_score": 1.0},
]
lab = linker.register_labels_table(labels)
linker.truth_space_table_from_labels_table(lab).as_pandas_dataframe()

</details>

The text was updated successfully, but these errors were encountered:

RobinL · 2024-04-23T09:01:55Z

Step by step:

Consider data like:

unique_id	first_name	surname	dob	city	email	cluster
0	Robert	Alan	1971-06-24		robert255@smith.net	0
1	Robert	Allen	1971-05-24		roberta25@smith.net	0
2	Gobert	Allen	1971-06-24	London	roberta25@smith.net	0
3	Robert	Alen	1971-06-24	Lonon		0
4	Grace		1997-04-26	Hull	grace.kelly52@jones.com	1
5	Grace	Kelly	1991-04-26		grace.kelly52@jones.com	1

Run a predict(), appending a cluster=cluster blocking rule

This gives us all rows recovered by blocking, plus all additional positive rows not recovered by blocking.

There are a lot of rows missed here. ALL THE MISSING ONES ARE ALL CORRECT: TRUE NEGATIVES - THEY'RE NOT RECOVERED BY BLOCKING AND THEY SHOULD BE SCORED AS 0.

Now have a table like:

clerical_match_score	found_by_blocking_rules	match_weight	match_probability	unique_id_l	unique_id_r	first_name_l	first_name_r	cluster_l	cluster_r	match_key
1	True	-2.28757	0.170001	0	1	Robert	Robert	0	0	0
1	False	-2.28757	0.170001	0	2	Robert	Gobert	0	0	1
1	True	2.71243	0.867624	1	3	Robert	Robert	0	0	0
1	False	5.71243	0.981285	2	3	Gobert	Robert	0	0	1
1	True	9.71243	0.998809	4	5	Grace	Grace	1	1	0
1	True	9.71243	0.998809	0	3	Robert	Robert	0	0	0
1	False	12.7124	0.999851	1	2	Robert	Gobert	0	0	1

Then add c_P and c_N. These stand for c_P = "this row is clerically labelled as a positive". = clerical_positive
c_P = clerical_positive
c_N = clerical_negative

c_P	clerical_match_score	found_by_blocking_rules	match_weight	unique_id_l	unique_id_r	match_key	truth_threshold
1	1	True	-2.28757	0	1	0	-2.28757
1	1	False	-2.28757	0	2	1	-2.28757
1	1	True	2.71243	1	3	0	2.71243
1	1	False	5.71243	2	3	1	5.71243
1	1	True	9.71243	4	5	0	9.71243
1	1	True	9.71243	0	3	0	9.71243
1	1	False	12.7124	1	2	1	12.7124

Can now do something like:

truth_threshold	num_records_in_row	clerical_positive
-2.28757	2	2
2.71243	1	1
5.71243	1	1
9.71243	2	2
12.7124	1	1

Which says: "For records scored at the truth threshold (splink score threshold) of -2.28757, two of our comparisons are in fact matches

Note there are no clerical negatives here. This is the fundamental problem.

!! We need to add to clerical_negative the (total number of comparisons - total clerical positive labels) !!

truth_threshold	cum_clerical_P	total_clerical_P	row_count	N_labels	P_labels
-2.28757	7	7	7	0	7
2.71243	5	7	7	2	5
5.71243	4	7	7	3	4
9.71243	3	7	7	4	3
12.7124	1	7	7	6	1

RobinL · 2024-04-23T09:47:09Z

__splink__labels_with_pos_neg_grouped_with_stats_adj is being computed wrong.

we end up with no true negatives and lots of false negatives, which is wrong

Need to properly understand and define

as cum_clerical_P
as cum_clerical_N
as total_clerical_P
as total_clerical_N
as row_count
as N_labels

to work out the correct adjustments

Original Name	Descriptive Name
`cum_clerical_P`	`cumulative_clerical_positives_at_or_above_threshold`
`cum_clerical_N`	`cumulative_clerical_negatives_below_threshold`
`total_clerical_P`	`total_clerical_positives`
`total_clerical_N`	`total_clerical_negatives`
`row_count`	`total_clerical_labels`
`N_labels`	`num_labels_scored_below_threshold`
`P_labels`	`num_labels_scored_at_or_above_threshold`

RobinL changed the title ~~Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink~~ [BUG] Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink Apr 23, 2024

RobinL added the bug Something isn't working label Apr 23, 2024

RobinL mentioned this issue Apr 24, 2024

Fix bugs in calculations for true negatives when using accuracy _from_column functions #2150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink #2059

[BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink #2059

RobinL commented Mar 14, 2024 •

edited

RobinL commented Apr 23, 2024 •

edited

RobinL commented Apr 23, 2024 •

edited

[BUG] Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink #2059

[BUG] Bug in truth_space_table_from_labels_column/Calculations with Blocking Rules in Splink #2059

Comments

RobinL commented Mar 14, 2024 • edited

RobinL commented Apr 23, 2024 • edited

RobinL commented Apr 23, 2024 • edited

[BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink #2059

[BUG] Bug in `truth_space_table_from_labels_column`/Calculations with Blocking Rules in Splink #2059

RobinL commented Mar 14, 2024 •

edited

RobinL commented Apr 23, 2024 •

edited

RobinL commented Apr 23, 2024 •

edited