Fair Unsupervised Learning (Clustering) Exploration #710

lurosenb · 2021-03-03T14:30:45Z

lurosenb
Mar 3, 2021

Hi everyone!

I’m relatively new to the fairlearn community, but I’m excited to start contributing!

I’ve been working on moving an effort forward to introduce some unsupervised learning algorithms to the fairlearn ecosystem, specifically fair clustering algorithms. For an awesome overview of the space, Roman has a review here: review.

I put together a basic implementation/sample notebook for one of the fair clustering algorithms (fairlet k-means) on my fairlearn fork that replicates a well-known paper (the paper, my fork). However, after reaching out to Roman, Miro, and Hanna, it seems like it may be better to develop an incubation project (separate from fairlearn/fairlearn, perhaps fairlearn/fair-clustering or something) that centers on a specific application scenario for fair clustering.

There are more than a few potential applications for fair clustering algorithms, although pinpointing exact use cases can be tricky. To move this work forward initially, perhaps we could focus on opinion leaders and market segmentation, using samples culled specifically from social media, advertising and review based open source datasets (i.e. open source social graphs, reddit/twitter data and online reviews of films, music, etc.)

In these scenarios, features of interest can often be protected classes like gender, race or income. Market segmentation might mean targeted advertising to a specific group that provides them with unfair exposure to products, deals, or otherwise more attention paid to them as consumers. Identifying opinion leaders, which gives individuals undue influence, without balancing for the protected classes mentioned also appears fraught.

Determining a compelling application for the many existing methods of fair clustering presents a significant opportunity (and challenge). I could not find an instance of someone deploying one of these fair clustering approaches in a real-world-ish context (though please let me know if I missed something.)

I had a discussion this past week with Ana Stoica, one of Augustin Chaintreau’s students, who is looking into fair clustering in her own right. She has a background in graph theory and social networks, and showed me some interesting work on segmentation problems (Segmentation problems), fair spectral clustering (Guarantees for Spectral Clustering with Fairness Constraints) and even pointed me to yet another (very recent) paper on fair k-means clustering (Socially Fair k-Means Clustering). She’s also very excited about potentially contributing to a fair clustering effort through fairlearn, which is great.

I’d love to hear thoughts from the community on the above proposal! Specifically:

Should we have an incubator project fairlearn/fairlearn-clustering?
What requirements would be there to have such a project? (e.g. documentation, at least a single proper use case, maintainers)

Thanks so much everyone!

romanlutz · 2021-03-03T20:07:04Z

romanlutz
Mar 3, 2021
Maintainer

Thanks for sharing @lurosenb ! I personally am in favor of having this slightly separate to start with rather than in this repo, just to give it freedom to develop. If we add it here it's going to be part of the fairlearn package and we have to focus a lot more on APIs and documentation rather than figuring out the direction. If it's relevant for Fairlearn users we can always move it into the repo at a later point when it stabilizes, of course.

As you probably read from my review linked above in @lurosenb 's post I have some interest in this topic as well. Additionally, I've chatted with @matthklein in the past about such a project. Perhaps he's also interested (no pressure!).

The only two things that are really important to me here are
a) that we make sure it's oriented around real world use cases (rather than abstract and theoretical) and
b) that we have people willing to maintain it.

Your proposal indicates that you've thought about a) quite a bit, which is great!
For b) I'm willing to sign up as one of the maintainers, but there should be at least one or two more. @lurosenb is it fair to assume that you might be open to that?

I definitely would love to hear what others think as well, of course. I think this sort of thing requires steering committee approval @MiroDudik @hildeweerts @adrinjalali but all opinions are welcome, of course!

8 replies

hildeweerts Mar 4, 2021
Maintainer

I really like the enthusiasm @lurosenb!

I am not familiar with fair clustering, but it seems like something that makes perfect sense as an addition to Fairlearn. Like Adrin I also don't worry too much about the API.

My main concern is with connecting these techniques/metrics to real-world harms. The goal of clustering is usually much less well-defined (or at least more variable) than classification, which makes it even more tricky to make this connection explicit compared to classification.

I'd like us to be very careful about adding techniques simply because they exist. Not to be a party pooper, but just because something was peer-reviewed does not mean it is going to be useful to practitioners. I actually think that in this case it might be better to start with documentation/user guide rather than code. This will allow us to make it explicit when/how a technique or metric should be used and consequently whether it should be added to Fairlearn at all.

From Roman's review I think the demographic parity equivalent for clustering has the clearest connection to real world harms. I am a bit more wary of techniques that focus on "clustering cost". Clustering is notoriously difficult to evaluate. Although there exist formal metrics, these often do not correspond to real-world outcomes in the way that classification metrics do.

TLDR: if we're going to do it, we're going to do it right 💪

lurosenb Mar 4, 2021
Author

Thanks so much everyone for the feedback (and openness)!

General consensus appears to be that it would make more sense for something like a well-formed, thought out feature for "fair clustering" to live in fairlearn/fairlearn (under a module labeled "experimental" during its development). Great to have reached some clarity there! Now it seems like the discussion has shifted to what a well-formed, thought out fair clustering offering even looks like.

I think Hilde brought up some really important concerns with introducing the work as is (thanks @hildeweerts!). Specifically, I appreciate her point highlighting the need to specify how and when a technique should be used. I would certainly want to include a careful consideration of practitioner use cases (caveats and all).

Perhaps introducing a method focused on demographic parity (like modification presented in Fair Clustering via Equitable Group Representations from Roman's review) along with the fairlet method in my fork makes sense? That way, the sample motivating scenario can include a couple of options for clustering, with a balanced discussion (no pun intended) of their pros and cons.

Thoughts on this? Other approaches/paths forward?

(also @romanlutz to answer your early inquiry, yes, I'd be very happy to assist in maintaining this feature if added : ))

romanlutz Mar 5, 2021
Maintainer

[sorry, wrote this before seeing Lucas' response above]

Let me see whether I can summarize (including what's been shared on today's community call):

A separate repo is for things like typescript visualizations or Fairlearn in R or something that really doesn't make sense in the fairlearn repo.
There is no need to have a separate project. Staying within fairlearn/fairlearn is fine. We haven't quite addressed the module question yet, but it's probably worth considering going with the same as sklearn which would be cluster.
We don't need a fully worked out use case notebook, but there should be a clearly defined use case where the technique addresses real world harms. On the flipside, we want to avoid adding techniques without any connection to the real world (even if they were peer reviewed).

Is this accurate @hildeweerts @adrinjalali @MiroDudik ?

hildeweerts Mar 5, 2021
Maintainer

In case it wasn't clear: ❤️ means YES! :D

I'd be happy to be involved in any discussions on the connection between the method/underlying notion of fairness/real-world harms.

MiroDudik Mar 5, 2021
Maintainer

YES @romanlutz . That sounds perfect!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fair Unsupervised Learning (Clustering) Exploration #710

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Fair Unsupervised Learning (Clustering) Exploration #710

lurosenb Mar 3, 2021

Replies: 1 comment · 8 replies

romanlutz Mar 3, 2021 Maintainer

hildeweerts Mar 4, 2021 Maintainer

lurosenb Mar 4, 2021 Author

romanlutz Mar 5, 2021 Maintainer

hildeweerts Mar 5, 2021 Maintainer

MiroDudik Mar 5, 2021 Maintainer

lurosenb
Mar 3, 2021

Replies: 1 comment 8 replies

romanlutz
Mar 3, 2021
Maintainer

hildeweerts Mar 4, 2021
Maintainer

lurosenb Mar 4, 2021
Author

romanlutz Mar 5, 2021
Maintainer

hildeweerts Mar 5, 2021
Maintainer

MiroDudik Mar 5, 2021
Maintainer