Scikit-learn Projects

Customer Analytics & Data Mining for E-Commerce (Data Mining & Machine Learning):

Customer Analytics & Data Mining for E-Commerce (Data Mining & Machine Learning):

🛒 Online Retail Data-Mining & Analytics Platform

Role: Data Scientist / ML Engineer • Tech: Python ( Pandas · NumPy · scikit-learn · Prophet · LightGBM · Matplotlib / Seaborn ), SQL, Jupyter, Git


🚀 1-Sentence Pitch

I designed an end-to-end analytics pipeline that converts 500 K+ e-commerce transactions into segment-driven marketing actions, personalised product recommendations, and six-month sales forecasts—all from a single notebook.


🎯 Problem & Goal

The retailer held raw invoice data but lacked answers to three core questions:

  1. Who are our most valuable / risky customers?

  2. What should we recommend to each shopper right now?

  3. How much will we sell in the next season, and when?

My goal was to extract these answers with minimum engineering overhead and maximum business clarity.


🛠️ Solution Architecture

Stage Key Tasks Algorithms & Tools
Data Prep Excel → DataFrame, null & duplicate pruning, outlier detection Tukey fences · Z-score · pd.to_datetime()
Exploratory EDA KPI dashboards by Country, Invoice, Customer Seaborn heatmaps, log-scaled histograms
RFM + Feature Eng. Recency, Frequency, Monetary, AvgOrderValue, CLV, LoyaltyScore Vectorised Pandas · discount factor
Customer Clustering PCA → K-means (k = 5) + VIP outlier bucket sklearn.decomposition.PCA · KMeans
Recommendation Engine - User-based CF (cosine)
- Content-based (Euclidean on price, popularity, brand)
Top-N recommender API
Predictive Modeling GradientBoostingRegressor with GridSearchCV → RMSE ↓ 12 % vs. baseline LightGBM · XGBoost comparison
Time-Series Forecast Daily revenue aggregation → Prophet: trend + weekly/annual seasonality RMSE validation · holiday regressor
Visual Reporting Cluster scatterplots, SHAP feature importances, forecast bands Matplotlib · SHAP

 


📈 Impact & Insights

  • VIP cohort (2 % of users) drives 31 % of revenue → triggered targeted loyalty e-mails.

  • Return-prone cluster identified (avg. 18 % return rate) → flagged for UX & description fixes.

  • Seasonality: December peak + April dip predicted; procurement aligned 6 weeks earlier.

  • Recommendation lift (offline test): +9 % expected AOV for top-decile customers.


🌟 Why It Matters

The project shows how small data teams can bootstrap a full analytics stack—from cleaning to forecasting—without heavyweight infra. Every notebook cell is reproducible; business users get clear graphs and CSV-ready outputs for CRM uploads.

Technologies: