Skip to content

CangyuanLi/race_ml

Repository files navigation

race_ml:

License: MIT Checked with mypy Code style: black Imports: isort

This repository contains the code (and selected data) necessary to train the model used by the pyethnicity package. The model uses novel L2 voter registration data from all 50 states in combination as its source of names, zip codes, and self-reported race. This repository is still in active development.

I train a Bidirectional LSTM to learn the association between name and race. I add location features using naive Bayes. Then, I expand the datasets used by Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved Firstname Surname Geocoding (BIFSG) using the L2 data and combine the outputs models with the LSTM. The resulting ensemble achieves up to 36.8% higher F1 scores than the next-best performing model.

Please see

pyethnicity

rethnicity

ethnicolr

About

Predict race from name and location (model development)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published