Measurement and Fairness: Next Steps for Fairlearn #707
Replies: 3 comments 4 replies
-
Thanks for starting this discussion! One immediate comment is that when it comes to "what format would work best for this" I think that 'format' should be plural. We can present the same information in multiple ways, and reach different audiences. Some people might like a section of the user guide which condenses the paper. Others might respond well to a short animated video (under five minutes say - not that I'm volunteering to do the animations). |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Yes, great and necessary paper and discussion! Thanks for starting this and your important qs about how to make the info accessible and implementable, Hilde. As a psychologist, I will say it’s tricky and can be resource intensive even though my field is so concerned with these principles/metrics. While the social sciences are very mindful of construct reliability and validity in measurement development, a lot of the work has been “checked and balanced” by reviewers who assess study design for funding and reviewers on the journal publication end—so there are a lot of guardrails to keep us “honest” and rigorous. Closely related, we tend to have the luxury of designing studies and collecting data in a way that will let us answer the primary questions relating to validity and reliability. So this is my long-winded way of stating the obvious that it may be very challenging (or impossible) for data scientists to check all the boxes. One way to generate and present different strategies might be by starting with how much control, if any, does the data scientist have in dictating the features that are being collected (new ongoing data collection vs. existing data). In thinking of a typical approach in psychology, we usually start by conducting lit search to see what is out there and if there are already validated measures we can use in data collection. If there aren’t any existing validated measures that we can use, then we develop new measure (e.g., set of survey questions; or task and coding scheme for behaviors). The lit review still is critical for identifying potential signals for attitudes or behaviors that might matter/have evidence for a given construct and should therefore be assessed by our new measure—this usually is critical for establishing rationale for face and content validity. When there’s no other established measure out there to compare your new measure to (or you don’t have the data on it) and can’t establish criterion validity (e.g., convergent, concurrent, discriminant), you could at least conduct confirmatory factor analysis for validity and Cronbach’s alpha for internal consistency (reliability) for survey scale items, for example; these techniques, along with the lit review, could also be applicable to an existing data set where you think certain features/ question items might hang together to represent a larger construct. When we develop an observational measure (e.g., human codes the behaviors/strategies a teacher uses in her instructional approach; human scores student performance across tasks), we also establish interrater reliability using Kappa or ICCs –so have at least two people who coded or rated an overlapping sample of the participants’ behaviors/performance. Ironically, even in some highly controlled studies, the outcome label relies only on one person’s judgment, like a psychologist diagnosing mental health disorder, and low reliability and measurement error can exist even when an expert is involved. I think I recently read an ML imaging study that used a consensus rating from 3 radiologists to determine the outcome label for each patient…what a perfect world. Examining test-retest reliability can be low hanging fruit when you are designing the data collection process and it's not burdensome to users/participants to complete the same measure multiple times. In that scenario, you just need to determine what meaningful measurement intervals should be used to establish stability/reliability. Obvi, not possible if you are working with an existing data set that didn’t collect the features at more than one time point or didn’t do so at an interval that is useful for establishing meaningful stability/reliability. In general, I think the minimum bar we try to meet in psychology when coming up with measures for explaining behavior is face and content validity along with some construct reliability metric as appropriate; and I think these are testable even when you are stuck with an existing data set. However, when we are working with clinical populations and trying to create measures to assess risk, diagnosis, or treatment response, the bar goes up quite a bit out of the gate and at least some of the criterion validity metrics need to be addressed with existing “gold-standard” measures. We do usually get to design these clinical studies from scratch so, while it’s resource intensive, we do get to check most, if not all, of the boxes. I’m sure I have forgotten something really important, lol…just roughly laying it out so ppl can think more about the way the process plays out sometimes and how it can be useful for data scientists. I think guides with different scenarios and vids can def help—it’s also something I think we were trying to point out in some of the case scenarios we had worked on like Pymetrics hiring process. Another thought…having representative data is one fundamental, early consideration in the ML pipeline that, of course, can also torpedo your measurement model and/or lead to very misleading inferences. Even when the data is representative, there is the possibility that a construct is not stable or uniform across everyone… the factor structure of your construct might be different for different groups—so the features that comprise constructs like trust or quality of life, for example, might be different by cultural group or, another example, aggression factor structure might be different by sex. Can imagine that a complex construct like fairness may also be similar. But that is a lot to worry about and its own rabbit hole —we’ve tried to be more mindful in psychology of that possibility when developing instruments/measures, but it is still lacking in the lit…hard work especially because you can’t closely examine without large representative samples. |
Beta Was this translation helpful? Give feedback.
-
Hi everybody!
I really enjoyed today's discussion on the measurement and fairness paper.
Perhaps we can use this discussion to already collect some of our thoughts about the paper and implications for fairlearn before next week's meeting.
Some of my thoughts
Because people involved in ML development typically come from a variety of backgrounds, one of my priorities would be to make any materials/tools as accessible as possible. So most of the questions I'm thinking about are related to ensuring the learning materials are suitable for the audience.
Beta Was this translation helpful? Give feedback.
All reactions