Hi, so good to see you. My name is Mateus and this repository is about study of Sales Prediction Rossmann drug store.
You can contact me:
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
Most of the fields are self-explanatory. The following are descriptions for those that aren't.
Id - an Id that represents a (Store, Date) duple within the test set
Store - a unique Id for each store
Sales - the turnover for any given day (this is what you are predicting)
Customers - the number of customers on a given day
Open - an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
StoreType - differentiates between 4 different store models: a, b, c, d
Assortment - describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance - distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
Promo - indicates whether a store is running a promo on that day
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store
This project is based on the CRISP-DM methodology which involves n interactions and is divided into 6 steps, namely:
-
Understanding the business: Here it is seen how the business works and what the business problem is
-
Understanding the data: For point 2, the definition of which data will be used, collection and exploration are addressed.
-
Data preparation: Here the data are converted, the features are extracted and other procedures that bring the project closer to the solution are carried out.
-
Data modeling: Here the data is modeled and explored.
-
Model Evaluation: Where models are used and evaluated.
-
Deployment: We put the model into production.
In this project, we use some machine learning models to compare performance and choose the one that best suits the data for production use.
We use Boruta to select the best features for learning in the model (Section 6.3). Boruta returned the following features as the most important: Store, Promo, Store Type, Assortment, Competition Distance, Competition Open Since Month, Competition Distance, Competition Open Since Year, Promo2, Promo2 Since Week, Promo2 Since Year, Competition Time Month, Promo Time Week.
Four machine learning models were used to analyze which one would fit the best, in this first interaction, with the business problem. The models used were:
-
Random Forest Regressor
-
Linear Regression
-
Lasso
-
XGBoost Return
The models had the following performances:
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Random Forest Regressor | 837.68 +/- 218.12 | 0.12 +/- 0.02 | 1256.33 +/- 318.28 |
Linear Regression | 2081.73 +/- 295.63 | 0.3 +/- 0.02 | 2952.52 +/- 468.37 |
Lasso | 2116.38 +/- 341.5 | 0.29 +/- 0.01 | 3057.75 +/- 504.26 |
XGBoost Regressor | 7047.94 +/- 587.59 | 0.95 +/- 0.0 | 7714.01 +/- 688.65 |
Mean Absolute Error (MAE): Mean absolute error over the real values and predicted by the machine learning model.
Mean Absolute Percentage Error (MAPE): Similar to MAE, but in percentage. Shows the average percentage of the model's prediction error.
Root Mean Squared Error (RMSE): Calculates the root mean square of errors between the actual and predicted values by the model.
For this first interaction, the model used will be the Random Forest Regressor, which at first presented a better performance.
Model Name | MAE | MAPE | RMSE |
---|---|---|---|
Random Forest Regressor | 837.68 +/- 218.12 | 0.12 +/- 0.02 | 1256.33 +/- 318.28 |
With the metrics and parameters used in this first interaction of the cycle, we see that we have three scenarios:
Predictions: It was predicted for the next six weeks a total sales of $ 268,804,261.
Best Scenario: Based on the prediction, we have a best scenario (in total) with $ 270,778,213.92, an increase of 0.73% compared to the prediction.
Worst Scenario: For the worst scenario we have a value of $ 266,830,308.07, a drop of 0.74% in relation to the prediction.
Scenario | Values |
---|---|
Predictions | $ 268.804.261,00 |
Best Scenario | $ 270.778.213,92 |
Worst Scenario | $ 266.830.308,07 |
-
Data Source: KAGGLE, Kaggle. Rossmann Store Sales: Forecast sales using store, promotion, and competitor data. [S. l.], 2016. Disponível em: https://www.kaggle.com/c/rossmann-store-sales. Acesso em: 12 nov. 2021.
-
CRISP-DM: A metodologia ideal para Ciência de Dados. Internet, 28 out. 2020. Disponível em: https://dnc.group/blog/data-science/metodologia-crisp-dm/. Acesso em: 12 dez. 2021