Skip to content
/ ET-AL Public

Entropy-targeted active learning for bias mitigation in materials data.

License

Notifications You must be signed in to change notification settings

Henrium/ET-AL

Repository files navigation

Entropy-Targeted Active Learning

DOI

This repository contains an implementation of entropy-targeted active learning (ET-AL) for materials data bias mitigation, associated with our paper.

ET-AL algorithm

Copyright

This code is open-sourced under the MIT license. Feel free to use all or portions for your research or related projects so long as you provide the following citation information:

Hengrui Zhang, Wei (Wayne) Chen, James M. Rondinelli, and Wei Chen, ET-AL: Entropy-targeted active learning for bias mitigation in materials data, Applied Physics Reviews 10, 021403 (2023).

@article{zhang2023etal,
    author = {Zhang, Hengrui and Chen, Wei Wayne and Rondinelli, James M. and Chen, Wei},
    title = {ET-AL: entropy-targeted active learning for bias mitigation in materials data},
    journal = {Applied Physics Reviews},
    volume = {10},
    number = {2},
    pages = {021403},
    year = {2023},
    doi = {10.1063/5.0138913},
    url = {https://doi.org/10.1063/5.0138913}
}

Descriptions

etal_main.py implements the ET-AL algorithm and demonstrates on the Jarvis-CFID dataset.

ML_comparison.ipynb compares several ML models on different training sets.

plot_data.ipynb is used for creating relevant plots for visualization.

datasets/ provides data required for reproducing the results in our paper.

results/ contains data generated in ET-AL demonstration on the Jarvis-CFID dataset

utils/ contains tools for data pre-processing:

  • Jarvis_data.ipynb is used for retrieving, cleaning the Jarvis CFID data and generating graph embeddings.
  • Jarvis_featurize.ipynb generates physical descriptors for the Jarvis CFID data.
  • compound_featurizer.py automatic tool for physical descriptors
  • cgcnn/ the CGCNN model for graph embeddings

Usage

Set up environment

Navigate to the code directory and create the environment:

conda env create -f environment.yml

Then activate the new environment:

conda activate gp-torch

Data preparation

Organize the dataset in a Data Frame and change the data paths in etal_main.py. For demonstration purposes, a dataset derived from the Jarvis CFID data is provided in datasets/: the crystal structures and properties are in data_cleaned.pkl, and the graph embeddings are in cgcnn_embeddings.pkl.

*Note: Git LFS is required for data_cleaned.pkl to be downloaded properly. Please download the file manually if you do not have Git LFS.

Run code

  1. Set up experimental parameters in etal_main.py: n_iter for maximum number of ET-AL iterations, n_test for number of data points left as test set, n_unlabeled for number of data points left as unlabeled. Edit the following part to change the selection of unlabeled data.

  2. Run ET-AL model:

python etal_main.py
  1. Run ML_comparison to compare ML models on training set generated by ET-AL sampling and random sampling.

  2. Use plot_data to visualize the results and reproduce plots in the paper.

About

Entropy-targeted active learning for bias mitigation in materials data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages