Skip to content

TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Amazon Beauty Product Recommendation System

Table of Contents


1. Introduction

In the era of e-commerce, users are overwhelmed by choices. Recommender Systems solve this by filtering information to provide personalized suggestions.

  • The Problem: Predicting a user's rating for a product they haven't seen yet, based on sparse historical interaction data.
  • The Motivation: Building a recommendation engine from scratch allows for a deep understanding of the underlying mathematics of machine learning algorithms, moving beyond "black box" libraries.
  • The Objective: To implement a Matrix Factorization (SVD) model using only NumPy (no Scikit-learn or Surprise for the core logic) to predict ratings and recommend top beauty products on Amazon.

2. Dataset

  • Source: Amazon Beauty Review Dataset.
  • Original Size: ~2 Million raw ratings.
  • Processed Size: ~158,801 ratings (after Data Cleaning and K-Core Filtering).
  • Entities (Post-filtering): 22,363 Users and 12,101 Items.
  • Sparsity: ~99.94% (The matrix is extremely sparse).
  • Characteristics: High sparsity, Long-tail distribution of items.
  • Key Features:
    • UserId: Unique identifier for the customer.
    • ProductId: Unique identifier for the product (ASIN).
    • Rating: Numeric score (1.0 to 5.0).
    • Timestamp: Time of the review (Unix format).

3. Methodology

This project strictly adheres to a "Pure NumPy" constraint for data processing and modeling optimization.

3.1 Data Preprocessing (Pure NumPy)

  • Cleaning: Handled by src/data_processing.py. Removed duplicates and validated data types.
  • Feature Engineering: Extracted Year from Unix Timestamp to analyze time trends and user behavior over time.
  • Data Reduction (K-Core Filtering): Applied an iterative 5-Core filter. Only users and products with at least 5 interactions were kept. This is crucial to reduce sparsity and mitigate the "Cold Start" problem.
  • Transformation: Implemented Integer Encoding to map string IDs to continuous integer indices (0 to N-1) using np.unique(return_inverse=True).

3.2 Modeling Strategy: Matrix Factorization

I predict the rating $\hat{r}_{ui}$ for user $u$ and item $i$ using the Singular Value Decomposition (SVD) approach:

$$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i^T$$

Where:

  • $\mu$: Global average rating.
  • $b_u, b_i$: User and Item bias terms.
  • $\mathbf{p}_u, \mathbf{q}_i$: Latent feature vectors for user and item.

3.3 Optimization & Algorithm Details

Instead of slow Python loops, the custom MatrixFactorization class in src/models.py uses Vectorized Mini-Batch Gradient Descent:

  • Loss Function: MSE with $L_2$ Regularization.
  • Vectorization: Used Broadcasting to compute errors for the whole batch at once.
  • Gradient Aggregation: Used np.add.at (unbuffered in-place operation) to accumulate gradients for users/items appearing multiple times in a single batch without using loops.
  • Optimization Features:
    • Learning Rate Decay.
    • Early Stopping to prevent overfitting.

4. Project Structure

└── 📁Lab02
    └── 📁data
        └── 📁processed     # Encoded .npy files (X_train, y_train, maps)
        ├── raw.csv         # Original CSV data
    └── 📁notebooks 
        ├── 01_data_exploration.ipynb    # EDA & Insights
        ├── 02_preprocessing.ipynb      # Cleaning, K-Core Filtering, Splitting
        ├── 03_modeling.ipynb       # Model training, Evaluation & Recommendation Demo
    └── 📁src
        └── 📁__pycache__          #Store bytecode file 
        ├── __init__.py             #Mark this folder is moudle Python
        ├── data_processing.py      # Core NumPy functions for cleaning/splitting
        ├── models.py               # Custom MatrixFactorization class
        ├── visualization.py        #Plotting helper functions
    ├── .gitattributes              #Configuration GIT LFS and format file
    ├── learning_curve_plot.png     #Plot learning curve
    ├── README.md                   # Project documentation
    └── requirements.txt            # Library dependencies

5. Installation & Setup

Prerequisites

  • Python 3.11+
  • Jupyter Notebook

Step

  1. Clone the repository:
git clone [https://github.com/TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset.git]
  1. Navigate to the project directory:
cd LAB02
  1. Install dependencies:
pip install -r requirements.txt

6. Usage

Run the notebooks in the following order to replicate the results:

  1. Exploration 01_data_exploration.ipynb:
  • Analyzes size, distribution and some basic information about dataset, checks for outliers.

` Make a question to answer and visualizes trends: "Does popularity imply quality?" and "Are power users stricter?".

  1. Preprocessing 02_preprocessing.ipynb:
  • Cleaning data by handling missing value, outlier, error format...

  • Performs K-Core filtering and Integer Encoding.

  • Splits data into Train/Test sets (80/20).

  • Saves processed artifacts to data/processed/.

  1. Modeling 03_modeling.ipynb:
  • Loads processed data.

  • Trains the custom Matrix Factorization model from scratch.

  • Evaluates performance and generates specific recommendations for users.

7. Results & Analysis

The model was trained for 50 epochs with n_factors=50, learning_rate=0.005, and regularization=0.02.

Quantitative Metrics (Test Set)

  • RMSE (Root Mean Square Error): 1.0913

Analysis: The model's prediction deviates by approx 1 star on average. This is a solid baseline for sparse data.

  • MAE (Mean Absolute Error): 0.8405

alt text

Ranking Quality (Top-10 Recommendations)

  • NDCG@10: 0.9886

Insight: This near-perfect score indicates the model is extremely effective at ranking. Relevant items are consistently placed at the top of the recommendation list.

  • Recall@10: 0.8500

Insight: The system successfully retrieves 85% of the items users actually liked

8. Challenges & Solutions

Challenge 1: Slow Training with Python Loops

  • Issue: Standard Stochastic Gradient Descent (SGD) loops through ratings one by one ($O(N)$), which is extremely slow in Python for millions of ratings.
  • Solution: Implemented Mini-Batch SGD combined with NumPy Vectorization. I calculated dot products for thousands of interactions simultaneously using matrix operations.

Challenge 2: Gradient Updates for Shared Indices

  • Issue: In a batch, the same User or Item might appear multiple times. Standard assignment grad[users] = ... overwrites values instead of summing them.

Solution: Used the advanced NumPy universal function np.add.at. This allows unbuffered in-place accumulation of gradients for duplicate indices without using a for loop.

9. Future Improvements

  • Hybrid Model: Incorporate "Year" and Review Text (Content-Based) to solve the Cold Start problem for new items.

  • Hyperparameter Tuning: Implement Grid Search to find the optimal n_factors and regularization.

  • Bias Modeling: Add time-decay bias to weigh recent reviews more heavily (handling concept drift).

10. Contributors

11. License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors