Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)

A collection of academic articles, published methodology, and datasets on the subject of Privacy-Preserving Explainable AI.

A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures

A sortable version is available here: https://awesome-privex.github.io/

Please read and cite our paper:

Nguyen, T.T., Huynh, T.T., Ren, Z., Nguyen, T.T., Nguyen, P.L., Yin, H. and Nguyen, Q.V.H., 2024. A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures. arXiv preprint arXiv:2404.00673.

Citation

@article{nguyen2024survey,
  title={A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures},
  author={Nguyen, Thanh Tam and Huynh, Thanh Trung and Ren, Zhao and Nguyen, Thanh Toan and Nguyen, Phi Le and Yin, Hongzhi and Nguyen, Quoc Viet Hung},
  journal={arXiv preprint arXiv:2404.00673},
  year={2024}
}

Existing Surveys

Paper Title	Venue	Year
Adversarial attacks and defenses in explainable artificial intelligence: A survey	Information Fusion	2024
A Survey of Privacy Attacks in Machine Learning	CSUR	2023
SoK: Taming the Triangle -- On the Interplays between Fairness, Interpretability and Privacy in Machine Learning	arXiv	2023
When Machine Learning Meets Privacy: A Survey and Outlook	CSUR	2021
Explaining Explanations: An Overview of Interpretability of Machine Learning	DSAA	2018
A Survey of Methods for Explaining Black Box Models	CSUR	2018
Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning	CCS	2018

Taxonomy

Approaches

Title	Year	Venue	Target Explanations	Attacks	Defenses	Code
Please Tell Me More: Privacy Impact of Explainability through the Lens of Membership Inference Attack	2024	SP	Feature-based	Membership Inference	Differential Privacy, Privacy-Preserving Models, DP-SGD	-
On the Privacy Risks of Algorithmic Recourse	2023	AISTATS	Counterfactual	Membership Inference	Differential Privacy	-
The Privacy Issue of Counterfactual Explanations: Explanation Linkage Attacks	2023	TIST	Counterfactual	Linkage	Anonymisaion	-
Feature-based Learning for Diverse and Privacy-Preserving Counterfactual Explanations	2023	KDD	Counterfactual	-	Perturbation	[Code]
Private Graph Extraction via Feature Explanations	2023	PETS	Feature-based	Graph Extraction	Perturbation	[Code]
Privacy-Preserving Algorithmic Recourse	2023	ICAIF	Counterfactual	-	Differential Privacy	-
Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage	2023	ICML-Workshop	Counterfactual	Membership Inference	Differential Privacy	-
Probabilistic Dataset Reconstruction from Interpretable Models	2023	arXiv	Interpretable Surrogates	Data Reconstruction	-	[Code]
DeepFixCX: Explainable privacy-preserving image compression for medical image analysis	2023	WIREs-DMKD	Case-based	Identity recognition	Anonymisation	[Code]
XorSHAP: Privacy-Preserving Explainable AI for Decision Tree Models	2023	Preprint	Shapley	-	Multi-party Computation	-
DP-XAI	2023	Github	ALE plot	-	Differential Privacy	[Code]
Inferring Sensitive Attributes from Model Explanations	2022	CIKM	Gradient-based, Perturbation-based	Attribute Inference	-	[Code]
Model explanations with differential privacy	2022	FAccT	Feature-based	-	Differential Privacy	-
DualCF: Efficient Model Extraction Attack from Counterfactual Explanations	2022	FAccT	Counterfactual	Model Extraction	-	-
Feature Inference Attack on Shapley Values	2022	CCS	Shapley	Attribute/Feature Inference	Low-dimensional	-
Evaluating the privacy exposure of interpretable global explainers, Privacy Risk of Global Explainers	2022	CogMI	Interpretable Surrogates	Membership Inference	-	-
Privacy-Preserving Case-Based Explanations: Enabling Visual Interpretability by Protecting Privacy	2022	IEEE Access	Example-based	-	Anonymisation	-
On the amplification of security and privacy risks by post-hoc explanations in machine learning models	2022	arXiv	Feature-based	Membership Inference	-	-
Differentially Private Counterfactuals via Functional Mechanism	2022	arXiv	Counterfactual	-	Differential Privacy	-
Differentially Private Shapley Values for Data Evaluation	2022	arXiv	Shapley	-	Differential Privacy	[Code]
Exploiting Explanations for Model Inversion Attacks	2021	ICCV	Gradient-based, Interpretable Surrogates	Model Inversion	-	-
On the Privacy Risks of Model Explanations	2021	AIES	Feature-based, Shapley, Counterfactual	Membership Inference	-	-
Adversarial XAI Methods in Cybersecurity	2021	TIFS	Counterfactual	Membership Inference	-	-
MEGEX: Data-Free Model Extraction Attack against Gradient-Based Explainable AI	2021	arXiv	Gradient-based	Model Extraction	-	[Code]
Robust Counterfactual Explanations for Privacy-Preserving SVM, Robust Explanations for Private Support Vector Machines	2021	ICML-Workshop	Counterfactual	-	Private SVM	[Code]
When Differential Privacy Meets Interpretability: A Case Study	2021	RCV-CVPR	Interpretable Models	-	Differential Privacy	-
Differentially Private Quantiles	2021	ICML	Quantiles	-	Differential Privacy	[Code]
FOX: Fooling with Explanations : Privacy Protection with Adversarial Reactions in Social Media	2021	PST	-	Attribute Inference	Privacy-Protecting Explanation	-
Privacy-preserving generative adversarial network for case-based explainability in medical image analysis	2021	IEEE Access	Example-based	-	Generative Anonymisation	-
Interpretable and Differentially Private Predictions	2020	AAAI	Locally linear maps	-	Differential Privacy	[Code]
Model extraction from counterfactual explanations	2020	arXiv	Counterfactual	Model Extraction	-	[Code]
Model Reconstruction from Model Explanations	2019	FAT*	Gradient-based	Model Reconstruction, Model Extraction	-	-
Interpret Federated Learning with Shapley Values	2019	__	Shapley	-	Federated	[Code]
Collaborative Explanation of Deep Models with Limited Interaction for Trade Secret and Privacy Preservation	2019	WWW	Feature-based	-	Collaborative rule-based model	-
Model inversion attacks that exploit confidence information and basic countermeasures	2015	CCS	Confidence scores	Reconstruction, Model Inversion	-	-

Datasets

Type: Image

Dataset	#Items	Disk Size	Downstream Explanations	#Papers Used
MNIST	70K	11MB	Counterfactuals, Gradient	4
CIFAR	60K	163MB	Gradient	4
SVHN	600K	400MB+	Gradient	1
Food101	100K+	10GB	Case-based	1
Flowers102	8K+	300MB+	Case-based	1
Cervical	8K+	46GB+	Case-based, Interpretable Models	1
CheXpert	220K+	GBs	Black-box	1
Facial Expression	12K+	63MB	Gradient	1
Celeb	200K	GBs	Counterfactuals, Shapley, Gradient, Perturbation	1

Type: Tablular

Dataset	#Items	Disk Size	Downstream Explanations	#Papers Used
Adult	48K+	10MB	Counterfactuals, Shapley	10+
COMPAS	7K+	25MB	Counterfactuals, Shapley	2
FICO	10K+	≤ 1MB	Counterfactuals, Shapley	4
Boston Housing	500+	≤ 1MB	Counterfactuals, Shapley	1
German Credit	1K	≤ 1MB	Counterfactuals, Shapley	4
Student Admission	500	≤ 1MB	Counterfactuals, Shapley, Gradient, Perturbation	1
Student Performance	10K	≤ 1MB	Counterfactuals, Shapley	1
GMSC	150K+	15MB	Interpretable models, Counterfactuals	2
Diabetes	100K+	20MB	Feature-based	5
Breast Cancer	569	< 1MB	Feature-based	1

Type: Graph

Dataset	#Items	Disk Size	Downstream Explanations	#Papers Used
Cora	2K+	4.5MB	Feature-based	1
Bitcoin	30K	≤ 1MB	Counterfactuals	1
CIC-IDS2017	2.8M+	500MB	Black-box	1

Type: Text

Dataset	#Items	Disk Size	Downstream Explanations	#Papers Used
IMDB Review	50K	66MB	Black-box	1

Evaluation Metrics

Category	Evaluation Metrics	Formula/Description	Usage
Explanation Utility	Counterfactual validity	$\text{Pureness} = \frac{\text{no. value combinations with desired outcome}}{\text{no. value combinations}}$	Assess the range of attribute values within k-anonymous counterfactual instances. Consider all attributes, including those beyond quasi-identifiers
	Classification metric	$CM = \frac{\sum\limits_{i=1}^{N} \text{penalty}(tuple_i)}{N}$	Assess equivalence classes within anonymized datasets, focusing on class label uniformity.
	Faithfulness (RDT-Fidelity)	$\mathcal{F}(\mathcal{E}_X)$ (see our paper)	Reflect how often the model's predictions are unchanged despite perturbations to the input, which would suggest that the explanation is effectively capturing the reasoning behind the model's predictions.
	Sparsity	$H(p) = -\sum_{f \in M} p(f) \log p(f)$	A complete and faithful explanation to the model should inherently be sparse, focusing only on a select subset of features that are most predictive of the model's decision.
Information Loss	Normalised Certainty Penalty (NCP)	$\text{NCP}(G) = \sum\limits_{i=1}^{d} w_i \cdot \text{NCP}_{A_i}(G)$	Higher NCP values indicate a greater degree of generalization and more information loss. This metric helps in assessing the balance between data privacy and utility.
	Discernibility	$C_{DM}(g, k) = \sum_{VE s.t. \|E\| \geq k} \|E\|^2 + \sum_{VE ,s.t., \|E\| < k} \|D\|\|E\|$	Measure the penalties on tuples in a dataset after k-anonymization, reflecting how indistinguishable they are post-anonymization
	Approximation Loss	$\mathcal{E}(\hat{\phi}, \mathcal{Z}, f(X)) \triangleq \mathbb{E} [\mathcal{L}(\hat{\phi}, \mathcal{Z}, f(X)) - \mathcal{L}(\phi^*, \mathcal{Z}, f(X))].$	Measure the error caused by randomness added when minimizing the privacy loss as the expected deviation of the randomized explanation from the best local approximation
	Explanation Intersection	The percentage of bits in the original explanation that is retained in the privatised explanation after using differential privacy	The higher the better but due to privacy-utility trade-off, this metric should not be 100%.
Privacy Degree	k-anonymity	A person's information is indistinguishable from at least k-1 other individuals.	Refers to the number of individuals in the training dataset to whom a given explanation could potentially be linked.
	Information Leakage	$Pr_{i=1..k}\hat{\phi}(\mathbf{z_i}, X, f_D(X)) \leq e^{\hat{\varepsilon}} \cdot Pr[\hat{\phi}(\mathbf{z_i}, X, f'_D(X)) : \forall i] + \hat{\delta}$	If an adversary can access model explanations, they would not gain any additional information that could help in inferring something about the training data beyond what could be learned from the model predictions alone.
	Privacy Budget	The total privacy budget for all queries is fixed at $(\varepsilon, \delta)$.	The explanation algorithm must not exceed the overall budget across all queries. Stricter requirement $(\varepsilon_{min}, \delta_{min})$ is set for each individual query.
Attack Success	Precision/Recall/F1	$Prec = \frac{TP}{TP+FP}$, $Rec = \frac{TP}{TP+FN}$, $F1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$	Evaluate an attack's effectiveness in correctly and completely identifying the properties it is designed to infer.
	Balanced Accuracy	$BA = \frac{TPR + TNR}{2}$	Measures the accuracy of attack (e.g., membership prediction in membership inference attacks), on a balanced dataset of members and non-members.
	ROC/AUC	The ROC curve plots the true positive rate against the false positive rate at various threshold settings.	An AUC near 1 indicates a highly successful privacy attack, while an AUC close to 0.5 suggests no better performance than random guessing.
	TPR at Low FPR	Report TPR at a fixed FPR (e.g., 0.1%).	If an attack can pinpoint even a minuscule fraction of the training dataset with high precision, then the attack ought to be deemed effective.
	Mean Absolute Error (MAE)	$\ell_1 (\hat{x}, x) = \frac{1}{mn} \sum\limits_{j=1}^{m} \sum\limits_{i=1}^{n} \| \hat{x}_i^j - x_i^j \|,$	Gives an overview of how accurately an attack can reconstruct private inputs by averaging the absolute differences across all samples and features.
	Success Rate (SR)	$SR = \frac{\|\hat{X}_{val} \neq \perp\|}{mn}$	The ratio of successfully reconstructed features to the total number of features across all samples
	Model Agreement	$\text{Agreement} = \frac{1}{n} \sum\limits_{i=1}^{n} 1_{f_\theta(x_i) = h_\phi(x_i)}.$	A higher agreement indicates that the substitute model is more similar to the original model. When comparing two model extraction methods with the same agreement, the one with the lower standard deviation is preferred.
	Average Uncertainty Reduction	$Dist(\mathcal{D}^M, \mathcal{D}^{Orig}) = \frac{1}{n \cdot d} \sum\limits_{i=1}^{n} \sum\limits_{k=1}^{d} \frac{H(\mathcal{D}^M_{i,k})}{H(\mathcal{D}_{i,k})}$	The degree to which a data reconstruction attack is accurate, measured by the reduction in uncertainty across all features of all samples in the dataset

Disclaimer

Feel free to contact us if you have any queries or exciting news. In addition, we welcome all researchers to contribute to this repository and further contribute to the knowledge of this field.

If you have some other related references, please feel free to create a Github issue with the paper information. We will glady update the repos according to your suggestions. (You can also create pull requests, but it might take some time for us to do the merge)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
taxonomy1.png		taxonomy1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

taxonomy1.png

taxonomy1.png

Repository files navigation

Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)

Citation

Existing Surveys

Taxonomy

Approaches

Datasets

Type: Image

Type: Tablular

Type: Graph

Type: Text

Evaluation Metrics

About

Releases

Packages

Contributors 2

License

tamlhp/awesome-privex

Folders and files

Latest commit

History

Repository files navigation

Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)

Citation

Existing Surveys

Taxonomy

Approaches

Datasets

Type: Image

Type: Tablular

Type: Graph

Type: Text

Evaluation Metrics

About

Topics

Resources

License

Stars

Watchers

Forks