Amazon Beauty Product Recommendation System

1. Introduction

In the era of e-commerce, users are overwhelmed by choices. Recommender Systems solve this by filtering information to provide personalized suggestions.

The Problem: Predicting a user's rating for a product they haven't seen yet, based on sparse historical interaction data.
The Motivation: Building a recommendation engine from scratch allows for a deep understanding of the underlying mathematics of machine learning algorithms, moving beyond "black box" libraries.
The Objective: To implement a Matrix Factorization (SVD) model using only NumPy (no Scikit-learn or Surprise for the core logic) to predict ratings and recommend top beauty products on Amazon.

2. Dataset

Source: Amazon Beauty Review Dataset.
Original Size: ~2 Million raw ratings.
Processed Size: ~158,801 ratings (after Data Cleaning and K-Core Filtering).
Entities (Post-filtering): 22,363 Users and 12,101 Items.
Sparsity: ~99.94% (The matrix is extremely sparse).
Characteristics: High sparsity, Long-tail distribution of items.
Key Features:
- UserId: Unique identifier for the customer.
- ProductId: Unique identifier for the product (ASIN).
- Rating: Numeric score (1.0 to 5.0).
- Timestamp: Time of the review (Unix format).

3. Methodology

This project strictly adheres to a "Pure NumPy" constraint for data processing and modeling optimization.

3.1 Data Preprocessing (Pure NumPy)

Cleaning: Handled by src/data_processing.py. Removed duplicates and validated data types.
Feature Engineering: Extracted Year from Unix Timestamp to analyze time trends and user behavior over time.
Data Reduction (K-Core Filtering): Applied an iterative 5-Core filter. Only users and products with at least 5 interactions were kept. This is crucial to reduce sparsity and mitigate the "Cold Start" problem.
Transformation: Implemented Integer Encoding to map string IDs to continuous integer indices (0 to N-1) using np.unique(return_inverse=True).

3.2 Modeling Strategy: Matrix Factorization

I predict the rating $\hat{r}_{ui}$ for user $u$ and item $i$ using the Singular Value Decomposition (SVD) approach:

$$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i^T$$

Where:

$\mu$: Global average rating.
$b_u, b_i$: User and Item bias terms.
$\mathbf{p}_u, \mathbf{q}_i$: Latent feature vectors for user and item.

3.3 Optimization & Algorithm Details

Instead of slow Python loops, the custom MatrixFactorization class in src/models.py uses Vectorized Mini-Batch Gradient Descent:

Loss Function: MSE with $L_2$ Regularization.
Vectorization: Used Broadcasting to compute errors for the whole batch at once.
Gradient Aggregation: Used np.add.at (unbuffered in-place operation) to accumulate gradients for users/items appearing multiple times in a single batch without using loops.
Optimization Features:
- Learning Rate Decay.
- Early Stopping to prevent overfitting.

4. Project Structure

└── 📁Lab02
    └── 📁data
        └── 📁processed     # Encoded .npy files (X_train, y_train, maps)
        ├── raw.csv         # Original CSV data
    └── 📁notebooks 
        ├── 01_data_exploration.ipynb    # EDA & Insights
        ├── 02_preprocessing.ipynb      # Cleaning, K-Core Filtering, Splitting
        ├── 03_modeling.ipynb       # Model training, Evaluation & Recommendation Demo
    └── 📁src
        └── 📁__pycache__          #Store bytecode file 
        ├── __init__.py             #Mark this folder is moudle Python
        ├── data_processing.py      # Core NumPy functions for cleaning/splitting
        ├── models.py               # Custom MatrixFactorization class
        ├── visualization.py        #Plotting helper functions
    ├── .gitattributes              #Configuration GIT LFS and format file
    ├── learning_curve_plot.png     #Plot learning curve
    ├── README.md                   # Project documentation
    └── requirements.txt            # Library dependencies

5. Installation & Setup

Prerequisites

Python 3.11+
Jupyter Notebook

Step

Clone the repository:

git clone [https://github.com/TrNguyenMQuan/preprocessing_analysis_modeling_amazon_beauty_dataset.git]

Navigate to the project directory:

cd LAB02

Install dependencies:

pip install -r requirements.txt

6. Usage

Run the notebooks in the following order to replicate the results:

Exploration 01_data_exploration.ipynb:

Analyzes size, distribution and some basic information about dataset, checks for outliers.

` Make a question to answer and visualizes trends: "Does popularity imply quality?" and "Are power users stricter?".

Preprocessing 02_preprocessing.ipynb:

Cleaning data by handling missing value, outlier, error format...
Performs K-Core filtering and Integer Encoding.
Splits data into Train/Test sets (80/20).
Saves processed artifacts to data/processed/.

Modeling 03_modeling.ipynb:

Loads processed data.
Trains the custom Matrix Factorization model from scratch.
Evaluates performance and generates specific recommendations for users.

7. Results & Analysis

The model was trained for 50 epochs with n_factors=50, learning_rate=0.005, and regularization=0.02.

Quantitative Metrics (Test Set)

RMSE (Root Mean Square Error): 1.0913

Analysis: The model's prediction deviates by approx 1 star on average. This is a solid baseline for sparse data.

MAE (Mean Absolute Error): 0.8405

Ranking Quality (Top-10 Recommendations)

NDCG@10: 0.9886

Insight: This near-perfect score indicates the model is extremely effective at ranking. Relevant items are consistently placed at the top of the recommendation list.

Recall@10: 0.8500

Insight: The system successfully retrieves 85% of the items users actually liked

8. Challenges & Solutions

Challenge 1: Slow Training with Python Loops

Issue: Standard Stochastic Gradient Descent (SGD) loops through ratings one by one ($O(N)$), which is extremely slow in Python for millions of ratings.
Solution: Implemented Mini-Batch SGD combined with NumPy Vectorization. I calculated dot products for thousands of interactions simultaneously using matrix operations.

Challenge 2: Gradient Updates for Shared Indices

Issue: In a batch, the same User or Item might appear multiple times. Standard assignment grad[users] = ... overwrites values instead of summing them.

Solution: Used the advanced NumPy universal function np.add.at. This allows unbuffered in-place accumulation of gradients for duplicate indices without using a for loop.

9. Future Improvements

Hybrid Model: Incorporate "Year" and Review Text (Content-Based) to solve the Cold Start problem for new items.
Hyperparameter Tuning: Implement Grid Search to find the optimal n_factors and regularization.
Bias Modeling: Add time-decay bias to weigh recent reviews more heavily (handling concept drift).

10. Contributors

Author: TRẦN NGUYỄN MINH QUÂN
Student's ID: 23120342
Institution: University of Science, Viet Nam National University Ho Chi Minh
Contact: 23120342@student.hcmus.edu.vn | trnguyenmquan@gmail.com

11. License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Beauty Product Recommendation System

Table of Contents

1. Introduction

2. Dataset

3. Methodology

3.1 Data Preprocessing (Pure NumPy)

3.2 Modeling Strategy: Matrix Factorization

3.3 Optimization & Algorithm Details

4. Project Structure

5. Installation & Setup

6. Usage

7. Results & Analysis

8. Challenges & Solutions

9. Future Improvements

10. Contributors

11. License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
notebooks		notebooks
src		src
.gitattributes		.gitattributes
README.md		README.md
learning_curve_plot.png		learning_curve_plot.png
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Amazon Beauty Product Recommendation System

Table of Contents

1. Introduction

2. Dataset

3. Methodology

3.1 Data Preprocessing (Pure NumPy)

3.2 Modeling Strategy: Matrix Factorization

3.3 Optimization & Algorithm Details

4. Project Structure

5. Installation & Setup

6. Usage

7. Results & Analysis

8. Challenges & Solutions

9. Future Improvements

10. Contributors

11. License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages