Instacart Market Basket Analysis

Chinuteja
8 min readJun 15, 2021

--

Instacart is an American company that provides grocery delivery and pick-up service. The Company operates in the U.S and Canada. Unlike another E-commerce website providing products directly from Seller to Customer. Instacart allows users to buy products from participating vendors. And this shopping is done by a Personal Shopper.

Task: The task here is to build a model that predicts if the user order’s the product again or not.

Table of contents

  1. Business Problem
  2. Data Overview
  3. Why ML approach is needed?
  4. How to approach the problem?
  5. What is the metric used?
  6. EDA
  7. Feature engineering
  8. Modeling
  9. Feature importance
  10. Deployment using Flask
  11. Conclusion
  12. References

1.Business problem

Instacart cart is an American company which operates a grocery delivery in USA and Canada. Like a normal grocery shop this platform has regular customers who often purchase from this platform. So the main objective of this problem is which products user will purchase again based on previous orders when he/she visits the platform again.

2.Data Overview

There are 6 csv files which can be downloaded from kaggle competition. The above diagram gives a brief overview of the datasets what we have let us discuss about each dataset.

aisles.csv: Here aisle_id is the primary key aisle is the aisle name is the corresponding aisle name

departments.csv: The primary key is department_id and department is its corresponding department name

products.csv: Here product_id is the primary key and product_name is its corresponding product name. In the same file we will have aisle_id and department_id which says to which aisle and department the corresponding product belongs to.

prior_orders.csv: This file consists of all the prior transaction details. order_id and product_id are the both foreign keys. The add_to_cart_order columns represents if a particular product is added to the basket or not. Reorder says if the product is reordered or not .

orders.csv: This file consists of various columns let us go one by one. order_id is the primary key and user_id is the foreign key.order_number represent no of orders by an user. order_dow represents on which day a particular order is placed and it ranges from 0–6 i.e sunday to saturday. order_hour_day represents on which hour the order got placed and it ranges from 0–24 i.e 24hours in a day. days_since_prior_order it says after how many days the product is reordered and it ranges from 0–30 i.e a month. eval_set says if the given row belongs to train,test or prior data.

train_order.csv: It is similar to prior_orders.csv file and this data we use for training the model.

3.Why do we need ML approach?

For example take a case if person ordered 26 items say a,b,c — — — z so based on the previous orders next time we need to recommend him the item
we can randomly suggest him the items based on his previous orders lets take a scenario
If he ordered item a 25 times, b 24 times, c 23 times d 22 times, e 24 times x,y,z 1 time each next time we need to suggest him the items a,b,c,d,e on top priority not x,y,z so we cannot randomly pick items that’s why we need ML approach for this case study.

For the user to get product recommendations based on his past N orders, we need to observe patterns and generate rules which will give recommendations with high probability. Since we have over 3 Million data points, we need to automate this learning process and using Machine Learning we can achieve this to give probabilistic prediction. Machine Learning works better on large sets of data and generates rules from patterns learned from features.

Other Alternative would be a rule based system, which works best when we know the rules. But it’s very difficult to generate rules by going over all data samples manually and make sense of the patterns. This can’t guarantee in high predictive power

4.How to approach the problem?

So based on the orders history and user preferences of product we need to predict whether a product can be reordered or not. So its more kind of a binary classification problem.

5.What is the metric used?

As its a classification task we go for f1 score. The reason we are going for f1 scores because we must make sure the we have very less no of False Positive cases.

6.EDA

Most of the EDA is done by using bar graphs so that it will be easy for understanding by a lay man.

here we can see that bananas are the most ordered products

So we can see that bananas are the most frequently ordered product followed by organic strawberries and baby spinach. So a user is more likely to reorder bananas, organic strawberries and baby spinach.

As you can see about 42% of orders are done in morning i.e 6am to 12 pm. And the least no of orders are placed in the early mornings i.e from 1am to 5 am. Since most of the people likely to buy products on morning and after noon.

More precisely most of the orders are placed between 10 am and 5 pm. 10th hour of the day is the peak time and 3rd hour of the i.e 3am early in the morning the platform experiences least number of orders.

Almost 20% of orders are placed on day 0 i.e sunday. So people are more likely to buy products on sundays since it is a holiday for majority of the people which makes sense.

About 50% orders and reorders are made on first week of the month. Since most of the people get salaries on the first week of the month this makes sense.

Order vs reorder

About almost 60% of the products are more likely to be reordered.

For more EDA stuff u can follow the EDA.ipynb that I mentioned in the git repo in the end of the blog. I don’t want to bore people with more and more EDA stuff.

7.Feature engineering

This is the most tricky part in the entire case study. The better feature engineering we do the better results we get. So I am mentioning few feature engineering techniques in this blog.

a) user features:

user_reorder_rate: here we calculated the reorder rate of user i.e total_reorders/total_orders for each user

no_unique_products_user: Total no of unique ordered by an user

avg_cart_size: mean Cart size of an user when ever user visits the platform.

avg_days_between_orders: avg no of days gap between two orders

reorder_rate_f15: user reorder rate for first 15days of the month

reorder_rate_l15: user reorder rate for last 15 days of the month

b) user product features:

user_product_reorder_rate: no times user ordered the product/ no times user placed an order.

user_product_avg_position: Average position of product in the cart on orders placed by user

c) product features

product_reorder_rate: no of times the product got reordered

dept_reorder_rate : no of times a department got reordered

aisle_reorder_rate: no of times a aisle got reordered

d) misc.features:

order_hour_of_day_rate: on which hour of the most orders are placed.

dow_rate: on which day most orders are placed.

word2vec: we have implemented word2vec feature for product name, department name and aisle name using spacy and PCA. This particular feature is used to boost up the f1score.

In a similar manner I had generated around 70 features describing each feature in the blog is boring and difficult for more feature engineering techniques you can follow featureization_2-Copy1.ipynb file.

8.Modelling

I had trained 3 models i.e LGBM Classifier, SGD Classifier and Decision Tree. On which we got better results on LGBM Classifier. Here better results mean getting better f1score public and private score once we submit the result .csv file on kaggle. You can see the training part in the modelling_part1-Copy1.ipynb file.

9.Feature Importance

So as I mentioned earlier that I had generated around 70 features so I might not require all the features to get the best classification for final output. So I had used LGBM Classifier to get the important features.

Out of all the features user_product_since_last order has highest priority followed by user_id, user_product_reorder_rate.

The reason for feature importance is when we deploy the model for production if a new query points come into the picture calculating all the 70 features for the new query point is time consuming so by doing feature importance we can pick top 10–15 features with that features we can predict the outcome more accurately. So calculating the 70 features for new query point and predicting the outcome , calculating the 15 features for the same query point and predicting the outcome will be the same so in order to save time and make the system low latency we go for feature importance.

You can see the feature importance in modelling_part1-Copy1.ipynb file. As I deployed the model in local host I took all the features if we go for cloud hosting we can pick only top 15 features and train a new model on these features and use the same for deployment.

10.Deployment using Flask

As I found LGBM Classifier works best I used it for final production i.e for the deployment in the local host using flask. You can see the hosting part in the hosting.ipynb file.

home page for the deployment

11.Conclusion

So finally with this case study I came across complete data science pipe line i.e getting the data, doing EDA, manipulating the data, feature engineering, preparing the data for training, training the model and deploying the model.

12.References

Association Rule — Extracting Knowledge using Market Basket Analysis

To checkout the entire work you can visit my git repo

You can connect me through linkedin

--

--