Measurement and Fairness: Next Steps for Fairlearn #707

hildeweerts · 2021-02-25T18:15:38Z

hildeweerts
Feb 25, 2021
Maintainer

Hi everybody!

I really enjoyed today's discussion on the measurement and fairness paper.
Perhaps we can use this discussion to already collect some of our thoughts about the paper and implications for fairlearn before next week's meeting.

Some of my thoughts

Because people involved in ML development typically come from a variety of backgrounds, one of my priorities would be to make any materials/tools as accessible as possible. So most of the questions I'm thinking about are related to ensuring the learning materials are suitable for the audience.

What is a good way to explain the main concepts: measurement modeling (including proxy variables), construct validity and construct reliability? What format would work best for this (a blog post, a video, an interactive walk-through)? How can we best balance in-depth discussion with the time constraints that practitioners might have?
How can we best explain the ways in which fairness and construct validity/reliability are related? Is a blog post format sufficient? Would it be useful to create "exercises" to drive the message home? Would some form of simulation help to further exemplify the relationship between measurement(/data collection) practices and fairness?
How can we use the lens of measurement modeling to help people realize that different fairness metrics operationalize different theoretical understandings of fairness? Can we integrate these ideas in e.g., the user guide/docs?
What tools and techniques are available for scientists to help them with assessing construct validity? How is this typically taught in computational social sciences? Which of these tools and techniques are most relevant for machine learning practitioners in relation to fairness? What is a good format that can assist practitioners in their daily work?

riedgar-ms · 2021-02-25T19:36:49Z

riedgar-ms
Feb 25, 2021
Maintainer

Thanks for starting this discussion! One immediate comment is that when it comes to "what format would work best for this" I think that 'format' should be plural. We can present the same information in multiple ways, and reach different audiences. Some people might like a section of the user guide which condenses the paper. Others might respond well to a short animated video (under five minutes say - not that I'm volunteering to do the animations).

1 reply

hildeweerts Feb 25, 2021
Maintainer Author

Excellent point!

LeJit · 2021-02-25T21:26:32Z

LeJit
Feb 25, 2021

Regarding Comp. Social Science tools, I think lavaan is the most used R package for techniques like confirmatory factor analysis. We can also look to the Personality Project for both how to design educational content around measurement modeling and how to build specific tooling (the R psych package).
On the education side, I think a blog post would be the easiest way to introduce the topic. I think something more involved, such as an interactive walk-through or exercises, is better suited for when we have tooling in Fairlearn around the concept.

1 reply

hildeweerts Mar 3, 2021
Maintainer Author

These are great examples!

From your background in both social sciences and data science, what kind of tools/techniques do you think are most relevant and reasonable for data scientists, particularly those who have a background in computer science and not so much in (applied) statistics? E.g., I have some limited experience with (organizational) psychology from my BSc, but I would be pretty intimidated if somebody told me to do a confirmatory factor analysis for a data science project.

Perhaps we can try to identify the "must have" concepts/techniques that any data scientist who cares about fairness should know about, as well as some more "nice to haves" for those who are already more familiar with it or who want to learn more. What do you think?

LisaIbanez · 2021-02-26T01:56:48Z

LisaIbanez
Feb 26, 2021

Yes, great and necessary paper and discussion! Thanks for starting this and your important qs about how to make the info accessible and implementable, Hilde.

As a psychologist, I will say it’s tricky and can be resource intensive even though my field is so concerned with these principles/metrics. While the social sciences are very mindful of construct reliability and validity in measurement development, a lot of the work has been “checked and balanced” by reviewers who assess study design for funding and reviewers on the journal publication end—so there are a lot of guardrails to keep us “honest” and rigorous. Closely related, we tend to have the luxury of designing studies and collecting data in a way that will let us answer the primary questions relating to validity and reliability. So this is my long-winded way of stating the obvious that it may be very challenging (or impossible) for data scientists to check all the boxes. One way to generate and present different strategies might be by starting with how much control, if any, does the data scientist have in dictating the features that are being collected (new ongoing data collection vs. existing data).

In thinking of a typical approach in psychology, we usually start by conducting lit search to see what is out there and if there are already validated measures we can use in data collection. If there aren’t any existing validated measures that we can use, then we develop new measure (e.g., set of survey questions; or task and coding scheme for behaviors). The lit review still is critical for identifying potential signals for attitudes or behaviors that might matter/have evidence for a given construct and should therefore be assessed by our new measure—this usually is critical for establishing rationale for face and content validity. When there’s no other established measure out there to compare your new measure to (or you don’t have the data on it) and can’t establish criterion validity (e.g., convergent, concurrent, discriminant), you could at least conduct confirmatory factor analysis for validity and Cronbach’s alpha for internal consistency (reliability) for survey scale items, for example; these techniques, along with the lit review, could also be applicable to an existing data set where you think certain features/ question items might hang together to represent a larger construct. When we develop an observational measure (e.g., human codes the behaviors/strategies a teacher uses in her instructional approach; human scores student performance across tasks), we also establish interrater reliability using Kappa or ICCs –so have at least two people who coded or rated an overlapping sample of the participants’ behaviors/performance. Ironically, even in some highly controlled studies, the outcome label relies only on one person’s judgment, like a psychologist diagnosing mental health disorder, and low reliability and measurement error can exist even when an expert is involved. I think I recently read an ML imaging study that used a consensus rating from 3 radiologists to determine the outcome label for each patient…what a perfect world. Examining test-retest reliability can be low hanging fruit when you are designing the data collection process and it's not burdensome to users/participants to complete the same measure multiple times. In that scenario, you just need to determine what meaningful measurement intervals should be used to establish stability/reliability. Obvi, not possible if you are working with an existing data set that didn’t collect the features at more than one time point or didn’t do so at an interval that is useful for establishing meaningful stability/reliability. In general, I think the minimum bar we try to meet in psychology when coming up with measures for explaining behavior is face and content validity along with some construct reliability metric as appropriate; and I think these are testable even when you are stuck with an existing data set. However, when we are working with clinical populations and trying to create measures to assess risk, diagnosis, or treatment response, the bar goes up quite a bit out of the gate and at least some of the criterion validity metrics need to be addressed with existing “gold-standard” measures. We do usually get to design these clinical studies from scratch so, while it’s resource intensive, we do get to check most, if not all, of the boxes. I’m sure I have forgotten something really important, lol…just roughly laying it out so ppl can think more about the way the process plays out sometimes and how it can be useful for data scientists. I think guides with different scenarios and vids can def help—it’s also something I think we were trying to point out in some of the case scenarios we had worked on like Pymetrics hiring process.

Another thought…having representative data is one fundamental, early consideration in the ML pipeline that, of course, can also torpedo your measurement model and/or lead to very misleading inferences. Even when the data is representative, there is the possibility that a construct is not stable or uniform across everyone… the factor structure of your construct might be different for different groups—so the features that comprise constructs like trust or quality of life, for example, might be different by cultural group or, another example, aggression factor structure might be different by sex. Can imagine that a complex construct like fairness may also be similar. But that is a lot to worry about and its own rabbit hole —we’ve tried to be more mindful in psychology of that possibility when developing instruments/measures, but it is still lacking in the lit…hard work especially because you can’t closely examine without large representative samples.

2 replies

hildeweerts Mar 3, 2021
Maintainer Author

Thank you so much for sharing, Lisa! This is all super interesting.

So this is my long-winded way of stating the obvious that it may be very challenging (or impossible) for data scientists to check all the boxes. One way to generate and present different strategies might be by starting with how much control, if any, does the data scientist have in dictating the features that are being collected (new ongoing data collection vs. existing data).

I think this is a very important point. To be honest, I would already be very happy if data scientists would consider the face validity of their target variable at a very basic level. I think you're right that a data scientist often does not have a lot of control of which data is collected. If the data simply doesn't fit the goal, our best advice will likely be to just not build the product. But I doubt that's a message people will easily accept...

When we develop an observational measure (e.g., human codes the behaviors/strategies a teacher uses in her instructional approach; human scores student performance across tasks), we also establish interrater reliability using Kappa or ICCs –so have at least two people who coded or rated an overlapping sample of the participants’ behaviors/performance.

Oh, I like this point a lot as well. Very relevant for both data sets based on historical decision-making (basically only 1 rater) and annotations (it seems most ML work indeed sees 3 as the magic number of annotators per item). For some reason I hadn't made the connection yet to interrater reliability.

Examining test-retest reliability can be low hanging fruit when you are designing the data collection process and it's not burdensome to users/participants to complete the same measure multiple times. In that scenario, you just need to determine what meaningful measurement intervals should be used to establish stability/reliability.

I feel like there's a connection to be made between test-retest reliability and concept drift. Even if your features are a good measure of the construct at deployment, the reliability of your measurements might change as the statistical properties of the target variable change (e.g., in the light of Goodhart's law: when a measure becomes a target, it ceases to be a good measure). Not sure if that makes sense so if not please ignore my rambling :')

Even when the data is representative, there is the possibility that a construct is not stable or uniform across everyone…

Also an important point! In my class I try to stress this difference that representation bias we care about for ML fairness =/= (statistical) selection bias (even though some overview papers do call it selection bias, but that's a bit misleading IMO). Differences in the relation between features/target across groups are AFAIK the most important source of quality-of-service harm, which can be amplified by a lack of data for minority groups.

LisaIbanez Mar 5, 2021

Oh I am def going to keep Goodhart's Law in mind from now on! All def makes sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measurement and Fairness: Next Steps for Fairlearn #707

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Measurement and Fairness: Next Steps for Fairlearn #707

hildeweerts Feb 25, 2021 Maintainer

Some of my thoughts

Replies: 3 comments · 4 replies

riedgar-ms Feb 25, 2021 Maintainer

hildeweerts Feb 25, 2021 Maintainer Author

LeJit Feb 25, 2021

hildeweerts Mar 3, 2021 Maintainer Author

LisaIbanez Feb 26, 2021

hildeweerts Mar 3, 2021 Maintainer Author

LisaIbanez Mar 5, 2021

hildeweerts
Feb 25, 2021
Maintainer

Replies: 3 comments 4 replies

riedgar-ms
Feb 25, 2021
Maintainer

hildeweerts Feb 25, 2021
Maintainer Author

LeJit
Feb 25, 2021

hildeweerts Mar 3, 2021
Maintainer Author

LisaIbanez
Feb 26, 2021

hildeweerts Mar 3, 2021
Maintainer Author