Creating a bot that predicts Rossmann future sales

Obs: The business problem is fictitious, although both company and data are real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Abstract

This Data Science project was developed with Rossmann data available on Kaggle in order to predict sales of the next six weeks for each store and determine the best resource allocation for each store renovation.

XGBoost machine learning model was trained to make the sales predictions, reaching a MAPE (mean percent error) of 14% and predicting a sales value of $283.7M in the following 6 weeks.

The architecture of the project is shown in the image below:

The solution was deployed at Heroku Cloud and the sales forecasts can be accessed through a Telegram bot available here.

2. Data Overview

The data was collected from Kaggle. This dataset contains historical sales data for 1,115 Rossmann stores. The initial features descriptions are available below:

Feature	Definition
Id	an Id that represents a (Store, Date) duple within the dataset.
Store	a unique Id for each store.
Sales	the turnover for any given day.
DayOfWeek	day of week on which the sale was made (e.g. DayOfWeek=1 -> monday, DayOfWeek=2 -> tuesday, etc).
Date	date on which the sale was made.
Customers	the number of customers on a given day.
Open	an indicator for whether the store was open: 0 = closed, 1 = open.
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools.
StoreType	differentiates between 4 different store models: a, b, c, d.
Assortment	describes an assortment level: a = basic, b = extra, c = extended.
CompetitionDistance	distance in meters to the nearest competitor store.
CompetitionOpenSince(Month/Year)	gives the approximate year and month of the time the nearest competitor was opened.
Promo	indicates whether a store is running a promo on that day.
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.
Promo2Since(Year/Week)	describes the year and calendar week when the store started participating in Promo2.
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

3. Assumptions

Customers column was dropped, because for now there's no information about the amount of customers six weeks into the future.
The NaN's in CompetitionDistance were replaced by 3 times the maximum CompetitionDistance in the dataset, because the observations with NaN's are likely stores that are too far, which means there's no competition.
Some new features were created in order to best describe the problem:

New Feature	Definition
day/week_of_year/year_week/month/year	day/week_of_year/year_week/month/year extracted from 'date' column.
day/day_of_week/week_of_year/month(sin/cos)	sin/cos component of each period, to capture their cyclical behavior.
competition_time_month	amount of months from competition start.
promo_time_week	time in weeks from when the promotion was active.
state_holiday(christmas/easter_holiday/public_holiday/regular_day)	indicates wheter the sale was made in christmas, easter, public holiday or regular day.

4. Solution Plan

4.1. How was the problem solved?

To predict sales values for each store (six weeks in advance) a Machine Learning model was applied. To achieve that, the following steps were performed:

Understanding the Business Problem : Understanding the reasons why Rossmann's CEO was requiring that task, and plan the solution.
Collecting Data : Collecting Rossmann store and sales data from Kaggle.
Data Cleaning : Renaming columns, changing data types and filling NaN's.
Feature Engineering : Creating new features from the original ones, so that those could be used in the ML model.
Exploratory Data Analysis (EDA) : Exploring the data in order to obtain business experience, look for useful business insights and find important features for the ML model. .
Data Preparation : Applying Normalization and Rescaling Techniques in the data, as well as Enconding Methods and Response Variable Transformation.
Feature Selection : Selecting the best features to use in the ML model by applying the Boruta Algorithm.
Machine Learning Modeling : Training Regression Algorithms with time series cross-validation. The best model was selected to be improved via Hyperparameter Tuning.
Model Evaluation : Evaluating the model using four metrics: MAE, MAPE and RMSE.
Financial Results : Translating the ML model's statistical performance to financial and business performance.
Model Deployment (Telegram Bot) : Implementation of a Telegram Bot that will give you the prediction of any given available store number. This is the project's Data Science Product, and it can be accessed from anywhere.

4.2. Tools and techniques used:

Python 3.9.13, Pandas, Matplotlib, Seaborn and Sklearn.
Jupyter Notebook and VSCode.
Flask and Python API's.
Ngrok and Telegram Bot.
Git and Github.
Exploratory Data Analysis (EDA).
Techniques for Feature Selection.
Regression Algorithms (Linear and Lasso Regression; Random Forest and XGBoost Regressors).
Cross-Validation Methods, Hyperparameter Optimization and Algorithms Performance Metrics (RMSE, MAE and MAPE).

5. Machine Learning Models

This was the most fundamental part of this project, since it's in ML modeling where the sales predictions for each store can be made. An average model was used as a baseline and four models were trained using time series cross-validation:

Linear Regression
Lasso Regression (Regularized Linear Regression)
Random Forest Regressor
XGBoost Regressor

The baseline model performance is displayed below:

Model Name	MAE	MAPE	RMSE
Average Model	1354.80	0.2064	1835.135542

The initial performance for all four algorithms are displayed below:

Model Name	MAE	MAPE	RMSE
Random Forest Regressor	1104.87 +/- 209.75	0.16 +/- 0.03	1530.38 +/- 273.38
XGBoost Regressor	1179.33 +/- 111.96	0.17 +/- 0.01	1639.3 +/- 148.34
Linear Regression	2079.0 +/- 280.91	0.3 +/- 0.01	2955.05 +/- 426.56
Lasso Regression	2090.34 +/- 307.94	0.3 +/- 0.01	2995.12 +/- 458.89

Both Linear Regression and Lasso Regression have worst performances in comparison to the simple Average Model. This shows a nonlinear behavior in our dataset, hence the use of more complex models, such as Random Forest and XGBoost.

The XGBoost model was chosen for Hyperparameter Tuning. Even if Random Forest has the best performance if we look into the metrics, XGBoost would still be better to use, because it's much faster to train and tune .

After tuning XGBoost's hyperparameters using Random Search the model performance has improved:

Model Name	MAE	MAPE	RMSE
XGBoost Regressor	949.881428	0.143602	1336.919406

5.1. Brief Financial Results:

Below there are displayed two tables with brief financial results given by the XGBoost model.

A couple interesting metrics to evaluate the financial performance of this solution is the MAE and MAPE. Below there's a table with a few stores metrics:

Store	Predictions (€)	Worst Scenario (€)	Best Scenario (€)	MAE (€)	MAPE
1	164,545.94	150,086.63	179,005.24	14,459.31	0.09
2	178,759.59	151,883.56	205,635.62	26,876.03	0.15
3	266,517.19	231,827.11	301,207.26	34,690.07	0.13
4	340,026.47	303,667.24	376,385.70	36,359.22	0.10
5	170,492.62	132,908.07	208,077.14	37,584.53	0.22

According to this model, the sales sum for all stores over the next six weeks is:

Scenario (€)	Total Sales of the Next 6 Weeks (€)
Prediction	$283,742,272.00
Worst Scenario	$244,033,471.48
Best Scenario	$323,451,121.16

6. Model Deployment

As previously mentioned, the complete financial results can be consulted by using the Telegram Bot. The idea behind this is to facilitate the access of any store sales prediction, as those can be checked from anywhere and from any electronic device, as long as internet connection is available. The bot will return you a sales prediction over the next six weeks for any available store, all you have to do is send him the store number in this format "/store_number" (e.g. /12, /23, /41, etc) . If a store number if non existent the message "Store not available" will be returned, and if you provide a text that isn't a number the bot will ask you to enter a valid store id.

To link to chat with the Rossmann Bot is

Because the deployment was made in a free cloud (Render) it could take a few minutes for the bot to respond, in the first request. In the following requests it should respond instantly.

7. Conclusion

In this project the main objective was accomplished:

A model that can provide good sales predictions for each store over the next six weeks was successfully trained and deployed in a Telegram Bot, which fulfilled CEO' s requirement, for now it's possible to determine the best resource allocation for each store renovation.

Contact

igorviniciusgpereira@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
Img		Img
Notebooks		Notebooks
api		api
model		model
parameter		parameter
rossman-telegram-api		rossman-telegram-api
webapp		webapp
.gitignore		.gitignore
README.md		README.md
m10_v01_store_sales_prediction.ipynb		m10_v01_store_sales_prediction.ipynb
requirements.txt		requirements.txt

igorvgp/DS_rossmann_stores

Folders and files

Latest commit

History

Repository files navigation

Creating a bot that predicts Rossmann future sales

1. Abstract

2. Data Overview

3. Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Machine Learning Models

5.1. Brief Financial Results:

6. Model Deployment

7. Conclusion

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages