US Census Data - Creating and Deploying a Classifier Pipeline as Web Service

This is the third project of the course MLOps Engineer Nanodegree by Udacity, called Deploying a Scalable Pipeline in Production. Its instructions are available in Udacity's repository.

We develop a classification model artifact for production on public available US Census Bureau data and monitor the model performance on various data slices as business goal.

Regarding data science goals for this classification prediction, we start with the ETL (Extract, Transform, Load) transformer pipeline including EDA (Exploratory Data Analysis) activities, diagrams and reports, followed by the ML (Machine Learning) pipeline for the investigated prediction model, in our case a binary XGBoost Classifier. The estimator is selected by using cross validation concept with early stopping for the training phase. This best estimator evaluated by metrics is selected as deployment artifact together with the associated column transformer used for data preprocessing.

General information about the deployed XGBoost classifier, the used data, their training condition and evaluation results can be found in the Model Card description.

Regarding software engineering principles, beside documentation, logging and python style, we create unit tests. Slice validation and the tests are incorporated into a CI/CD framework using GitHub Actions. Then, the model is deployed using the FastAPI web framework and Render as open-source web service.

The unit tests are written via pytest for GET and POST prediction requests for the FastAPI component as well as for the mentioned data and model task parts. All unit test results are reported in associated html files of the tests directory.

All project relevant configuration values, including model hyperparameter ranges for the cross validation concept, are handled via specific configuration config.yml file.
For versioning tasks, git and dvc, handled with ignore files content, are chosen. If a remote storage, like AWS S3 or Azure shall be used as future task, dvc[all] for the selected dvc version is installed via requirements.txt file as well for specific configuration. By now, only dvc 'local' remote is set.

Environment Set up

Working in a command line environment is recommended for ease of use with git and dvc. Working on Windows, WSL2 and Ubuntu (Linux) is chosen for this project implementation.
We expect you have at least Python 3.10.9 installed (e.g. via conda), furthermore having forked this project repo locally and activate it in your virtual environment to work on for your own. So, in your root directory path/to/census-project create a new virtual environment depending on the selected OS and use the supplied requirements.txt file to install the needed libraries e.g. via
```
  pip install -r requirements/requirements.txt
```

or use

  conda create -n [envname] "python=3.10.9" scikit-learn pandas numpy pytest jupyter jupyterlab fastapi uvicorn ... <the-needed-library-list> -c conda-forge

Project Structure

Main coding files are stored in the src and test scripts in the tests project root subdirectories. The FastAPI RESTful web application is called via main.py file stored in the src directory, but associated schemas and request examples data are part of the src/app subdirectory. All administrative asset files, like plots, screenshots, configuration, logs, as well as model and dataset files are stored in their own directories in parallel to the source code.
The general project structure looks like:
In our GitHub repository an automatic Action script is set up to check amongst others dependencies, linting and unit testing.

Data

The download raw census.csv file is preprocessed and stored as new .csv file. Both files are committed and versioned with dvc.
Some exploratory data analysis is implemented and visualised. They are stored as .png plot or screenshot files in the associated directories.

Examples are the following ones, regarding amongst others distributions of hours-per-week, salary, capital-gain and education by few feature attributes like age, sex or race. As investigated there is some bias according man (twice as much as women) and white people. Furthermore, it is interesting that according capital gain female representatives earn much often a much higher value for less working hours compared to man. In general, people work >40 hours per week if they are between 25 and 60 years old.

Several other insights are visualised and stored as .png files. So, have a look there if you are interested in further analysis.

Model

As machine learning model that trains on the clean data XGBoost Classifier is selected and the best found and evaluated estimator is stored as pickle file (...artifact.pkl) in the associated model directory.
Additionally, a function exists that outputs the performance of the model on slices of the categorical features. Performance evaluation metrics of such categorical census feature slices are stored in a slice_output.txt file. As an example, the metric block looks like:
```
  workclass - Private:
  Precision: 0.83, Recall: 0.66, Fbeta: 0.73
  Confusion Matrix: 
  [[2907  119]
  [ 297  572]]  
  
  workclass - Self-emp-not-inc:
  Precision: 0.83, Recall: 0.57, Fbeta: 0.68
  Confusion Matrix: 
  [[358  16]
  [ 58  77]]
  
  ...
```
As mentioned, the Model Card informs about our found insights of the binary classification estimator including evaluation diagrams and general metrics.

API Creation

As web framework to create a RESTful API FastAPI is chosen for app implementation. A pydantic BaseModel instance handels the POST body, e.g. dealing with hyphens in data feature names which is not allowed in Python.
As high performance ASGI server uvicorn is selected. The FastAPI web app uvicorn server can be started in the projects root directory via CLI python command:
```
  python ./src/main.py
```

There in "main" it calls

  uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

Remember, this code is for development purpose, in production the reload option shall be set to False resp. not used. In other words, the start command e.g. on our render deployment web service (see below) is:
uvicorn src.main:app --host 0.0.0.0 --port 8000

So, locally we start our implemented browser web application with
```
http://127.0.0.1:8000/docs
or
http://localhost:8000/docs
```

As an examples regarding the use case of having a person earning <=50K as income, you are going to get the following UI's:

API Deployment

As open-source tool for our web service deployment, we use Render and a free account there. From the Render.com landing page, click the "Get Started" button to open the sign-up page. You can create an account by linking your GitHub, GitLab, or Google account or provide your email and password. Then, the render account must be connected with our GitHub account, so, the usage of render services is guaranteed. Have in mind, shell and jobs are not supported for free instance types. As stated by FastAPI company tiangolo "For a web API, it normally involves putting it in a remote machine, with a server program that provides good performance, stability, etc, so that your users can access the application efficiently and without interruptions or problems." But using a free account, the service is limited.
Our new application is deployed from our public GitHub repository by creating a new Web Service for this specific project GitHub URL.

Because default render Python version is 3.7 and this version has issues with dvc, the environment variable PYTHON_VERSION has to be configured being version 3.10.9.
After selection, render starts its advanced deployment configuation, some parameters are already set, some have to be set manually appropriately. Render guides you through with easy to handle UI's.

That's it. Implement coding changes, push to the GitHub repository, and the app will automatically redeploy each time, but it will only deploy if your continuous integration action passes.

Regarding the automatically created render census-project app link used as browser link
```
https://census-project-xki0.onrender.com
```
we get the welcome page message

Have in mind: if you rely on your CI/CD to fail before fixing an issue, it slows down your deployment. Fix issues early, e.g. by running an ensemble linter like flake8 locally before committing changes.
For checking the render deployment, a python file exists that uses the httpx module to do one GET and POST on the live render web service and prints its results.

On the Render web service site after deployment
and as result of the httpx test script for GET and POST

License

This project coding is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
logs		logs
models		models
plots		plots
screenshots		screenshots
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
dvc_on_heroku_instructions.md		dvc_on_heroku_instructions.md
mlops_proj3_tree.txt		mlops_proj3_tree.txt
mlops_proj3_tree_dirs.txt		mlops_proj3_tree_dirs.txt
model_card.md		model_card.md
requirements.txt		requirements.txt
sanitycheck.py		sanitycheck.py
setup.py		setup.py

License

IloBe/US_CensusData_Classifier_PipelineWithDeployment

Folders and files

Latest commit

History

Repository files navigation

US Census Data - Creating and Deploying a Classifier Pipeline as Web Service

Environment Set up

Project Structure

Data

Model

API Creation

API Deployment

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages