Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 1 addition & 8 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,9 @@ RUN pip install --no-cache-dir --upgrade pip wheel poetry==1.5.1 poetry-dynamic-
COPY . RePlay-Accelerated/
RUN cd RePlay-Accelerated && ./poetry_wrapper.sh install --all-extras

RUN pip install --upgrade torch
RUN pip install --upgrade torch==2.5.1
RUN pip install rs_datasets
RUN pip install Ninja==1.11.1.1
RUN pip install -U tensorboard

RUN pip3 install triton
RUN pip3 install bitsandbytes
RUN sed -i 's/tl\.libdevice\.llrint/tl\.extra\.cuda\.libdevice\.llrint/g' \
/opt/conda/lib/python3.11/site-packages/bitsandbytes/triton/quantize_global.py \
/opt/conda/lib/python3.11/site-packages/bitsandbytes/triton/quantize_rowwise.py \
/opt/conda/lib/python3.11/site-packages/bitsandbytes/triton/quantize_columnwise_and_transpose.py

CMD ["bash"]
231 changes: 40 additions & 191 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,214 +1,63 @@
<img src="docs/images/replay_logo_color.svg" height="50"/>
<br>

[![GitHub License](https://img.shields.io/github/license/sb-ai-lab/RePlay)](https://github.com/sb-ai-lab/RePlay/blob/main/LICENSE)
[![PyPI - Version](https://img.shields.io/pypi/v/replay-rec)](https://pypi.org/project/replay-rec)
[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://sb-ai-lab.github.io/RePlay/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/replay-rec)](https://pypistats.org/packages/replay-rec)
<br>
[![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/sb-ai-lab/replay/main.yml)](https://github.com/sb-ai-lab/RePlay/actions/workflows/main.yml?query=branch%3Amain)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Python Versions](https://img.shields.io/pypi/pyversions/replay-rec.svg?logo=python&logoColor=white)](https://pypi.org/project/replay-rec)
[![Join the community on GitHub Discussions](https://badgen.net/badge/join%20the%20discussion/on%20github/black?icon=github)](https://github.com/sb-ai-lab/RePlay/discussions)


RePlay is an advanced framework designed to facilitate the development and evaluation of recommendation systems. It provides a robust set of tools covering the entire lifecycle of a recommendation system pipeline:

## 🚀 Features:
* **Data Preprocessing and Splitting:** Streamlines the data preparation process for recommendation systems, ensuring optimal data structure and format for efficient processing.
* **Wide Range of Recommendation Models:** Enables building of recommendation models from State-of-the-Art to commonly-used baselines and evaluate their performance and quality.
* **Hyperparameter Optimization:** Offers tools for fine-tuning model parameters to achieve the best possible performance, reducing the complexity of the optimization process.
* **Comprehensive Evaluation Metrics:** Incorporates a wide range of evaluation metrics to assess the accuracy and effectiveness of recommendation models.
* **Model Ensemble and Hybridization:** Supports combining predictions from multiple models and creating two-level (ensemble) models to enhance the quality of recommendations.
* **Seamless Mode Transition:** Facilitates easy transition from offline experimentation to online production environments, ensuring scalability and flexibility.

## 💻 Hardware and Environment Compatibility:
1. **Diverse Hardware Support:** Compatible with various hardware configurations including CPU, GPU, Multi-GPU.
2. **Cluster Computing Integration:** Integrating with PySpark for distributed computing, enabling scalability for large-scale recommendation systems.

<a name="toc"></a>
# Table of Contents

* [Quickstart](#quickstart)
* [Installation](#installation)
* [Resources](#examples)
* [Contributing to RePlay](#contributing)
This repository is a fork of the RePlay library, containing implementation Cut Cross Entropy (CCE) and Cut Cross Entropy with Negative Sampling (CCE-) for RecSys. Triton kernels are available in
`kernels/cut_cross_entropy`. Implementation of SASRec with CCE and CCE- can be found in `replay/models/nn/sequential/sasrec/lightning.py`. Experiment pipeline is located in `replay_benchmarks`.

<a name="installation"></a>
## Installation

<a name="quickstart"></a>
## 📈 Quickstart
Installation via `poetry` package manager is recommended by default:

```bash
pip install replay-rec[all]
```

Pyspark-based model and [fast](https://github.com/sb-ai-lab/RePlay/blob/main/examples/11_sasrec_dataframes_comparison.ipynb) polars-based data preprocessing:
```python
from polars import from_pandas
from rs_datasets import MovieLens

from replay.data import Dataset, FeatureHint, FeatureInfo, FeatureSchema, FeatureType
from replay.data.dataset_utils import DatasetLabelEncoder
from replay.metrics import HitRate, NDCG, Experiment
from replay.models import ItemKNN
from replay.utils.spark_utils import convert2spark
from replay.utils.session_handler import State
from replay.splitters import RatioSplitter

spark = State().session

ml_1m = MovieLens("1m")
K = 10

# convert data to polars
interactions = from_pandas(ml_1m.ratings)

# data splitting
splitter = RatioSplitter(
test_size=0.3,
divide_column="user_id",
query_column="user_id",
item_column="item_id",
timestamp_column="timestamp",
drop_cold_items=True,
drop_cold_users=True,
)
train, test = splitter.split(interactions)

# datasets creation
feature_schema = FeatureSchema(
[
FeatureInfo(
column="user_id",
feature_type=FeatureType.CATEGORICAL,
feature_hint=FeatureHint.QUERY_ID,
),
FeatureInfo(
column="item_id",
feature_type=FeatureType.CATEGORICAL,
feature_hint=FeatureHint.ITEM_ID,
),
FeatureInfo(
column="rating",
feature_type=FeatureType.NUMERICAL,
feature_hint=FeatureHint.RATING,
),
FeatureInfo(
column="timestamp",
feature_type=FeatureType.NUMERICAL,
feature_hint=FeatureHint.TIMESTAMP,
),
]
)

train_dataset = Dataset(feature_schema=feature_schema, interactions=train)
test_dataset = Dataset(feature_schema=feature_schema, interactions=test)

# data encoding
encoder = DatasetLabelEncoder()
train_dataset = encoder.fit_transform(train_dataset)
test_dataset = encoder.transform(test_dataset)

# convert datasets to spark
train_dataset.to_spark()
test_dataset.to_spark()

# model training
model = ItemKNN()
model.fit(train_dataset)
pip install --no-cache-dir --upgrade pip wheel poetry==1.5.1 poetry-dynamic-versioning \
&& python -m poetry config virtualenvs.create false
./poetry_wrapper.sh install --all-extras

# model inference
encoded_recs = model.predict(
dataset=train_dataset,
k=K,
queries=test_dataset.query_ids,
filter_seen_items=True,
)

recs = encoder.query_and_item_id_encoder.inverse_transform(encoded_recs)

# model evaluation
metrics = Experiment(
[NDCG(K), HitRate(K)],
test,
query_column="user_id",
item_column="item_id",
rating_column="rating",
)
metrics.add_result("ItemKNN", recs)
print(metrics.results)
```

<a name="installation"></a>
## 🔧 Installation

Installation via `pip` package manager is recommended by default:

After installing replay, it is required to update torch and install additional packages:
```bash
pip install replay-rec
pip install --upgrade torch==2.5.1
pip install --upgrade pytorch-lightning==2.5.1
pip install numpy==1.24.4
pip install rs_datasets
pip install -U tensorboard
```

In this case it will be installed the `core` package without `PySpark` and `PyTorch` dependencies.
Also `experimental` submodule will not be installed.

To install `experimental` submodule please specify the version with `rc0` suffix.
For example:
<a name="examples"></a>
## Usage

To run the experiments for training SASRec, use the following command from the RePlay-Accelerated directory:
```bash
pip install replay-rec==XX.YY.ZZrc0
python main.py
```

### Extras
Experiment parameters are defined in `.yaml` files located in the configs directory.
The dataset name is specified in the `config.yaml` file as follows:

In addition to the core package, several extras are also provided, including:
- `[spark]`: Install PySpark functionality
- `[torch]`: Install PyTorch and Lightning functionality
- `[all]`: `[spark]` `[torch]`

Example:
```bash
# Install core package with PySpark dependency
pip install replay-rec[spark]

# Install package with experimental submodule and PySpark dependency
pip install replay-rec[spark]==XX.YY.ZZrc0
Parameters for the experiments are defined by `.yaml` files in `configs` directory.
Name of the dataset in determined in the file `config.yaml`:
```
defaults:
- dataset: <dataset_name>
- model: sasrec_<dataset_name>
```
The following datasets are available `movielens_20m`, `beauty`, `30music`, `zvuk`, `megamarket`.

To build RePlay from sources please use the [instruction](CONTRIBUTING.md#installing-from-the-source).


<a name="examples"></a>
## 📑 Resources

### Usage examples
1. [01_replay_basics.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/01_replay_basics.ipynb) - get started with RePlay.
2. [02_models_comparison.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/02_models_comparison.ipynb) - reproducible models comparison on [MovieLens-1M dataset](https://grouplens.org/datasets/movielens/1m/).
3. [03_features_preprocessing_and_lightFM.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/03_features_preprocessing_and_lightFM.ipynb) - LightFM example with pyspark for feature preprocessing.
4. [04_splitters.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/04_splitters.ipynb) - An example of using RePlay data splitters.
5. [05_feature_generators.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/05_feature_generators.ipynb) - Feature generation with RePlay.
6. [06_item2item_recommendations.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/06_item2item_recommendations.ipynb) - Item to Item recommendations example.
7. [07_filters.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/07_filters.ipynb) - An example of using filters.
8. [08_recommending_for_categories.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/08_recommending_for_categories.ipynb) - An example of recommendation for product categories.
9. [09_sasrec_example.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/09_sasrec_example.ipynb) - An example of using transformer-based SASRec model to generate recommendations.
10. [10_bert4rec_example.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/10_bert4rec_example.ipynb) - An example of using transformer-based BERT4Rec model to generate recommendations.
11. [11_sasrec_dataframes_comparison.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/11_sasrec_dataframes_comparison.ipynb) - speed comparison of using different frameworks (pandas, polars, pyspark) for data processing during SASRec training.
12. [12_neural_ts_exp.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/12_neural_ts_exp.ipynb) - An example of using Neural Thompson Sampling bandit model (based on Wide&Deep architecture).
13. [13_personalized_bandit_comparison.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/13_personalized_bandit_comparison.ipynb) - A comparison of context-free and contextual bandit models.
14. [14_hierarchical_recommender.ipynb](https://github.com/sb-ai-lab/RePlay/blob/main/examples/14_hierarchical_recommender.ipynb) - An example of using HierarchicalRecommender with user-disjoint LinUCB.

### Videos and papers
* **Video guides**:
- [Replay for offline recommendations, AI Journey 2021](https://www.youtube.com/watch?v=ejQZKGAG0xs)
Parameters for SASRec are defined in the sasrec_<dataset_name>.yaml files.
To use CCE-, specify the following configuration:
```
loss_type: CCE
loss_sample_count: <number_of_negative_samples>
```
If `loss_sample_count: null`, the training will use the standard CCE method.

* **Research papers**:
- [RePlay: a Recommendation Framework for Experimentation and Production Use](https://arxiv.org/abs/2409.07272) Alexey Vasilev, Anna Volodkevich, Denis Kulandin, Tatiana Bysheva, Anton Klenitskiy. In The 18th ACM Conference on Recommender Systems (RecSys '24)
- [Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?](https://doi.org/10.1145/3604915.3610644) Anton Klenitskiy, Alexey Vasilev. In The 17th ACM Conference on Recommender Systems (RecSys '23)
- [The Long Tail of Context: Does it Exist and Matter?](https://arxiv.org/abs/2210.01023). Konstantin Bauman, Alexey Vasilev, Alexander Tuzhilin. In Workshop on Context-Aware Recommender Systems (CARS) (RecSys '22)
- [Multiobjective Evaluation of Reinforcement Learning Based Recommender Systems](https://doi.org/10.1145/3523227.3551485). Alexey Grishanov, Anastasia Ianina, Konstantin Vorontsov. In The 16th ACM Conference on Recommender Systems (RecSys '22)
- [Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?](https://doi.org/10.1145/3460231.3478848) Yan-Martin Tamm, Rinchin Damdinov, Alexey Vasilev. In The 15th ACM Conference on Recommender Systems (RecSys '21)
To reproduce CE- grid search results, we provide a special trainer. It is available in `replay_benchmarks/grid_params_search_runner.py`. To set a grid for grid-search, you can modify the `replay_benchmarks/configs/mode/hyperparameter_experiment.yaml` file. Additionally, you need to change the usage mode in the main config (`replay_benchmarks/configs/config.yaml`). There, the parameter `mode: train` should be changed to `mode: hyperparameter_experiment`.

<a name="contributing"></a>
## 💡 Contributing to RePlay
The `hyperparameter_experiment.yaml` configuration is used solely to iterate over `batch_size`, `max_seq_len`, and `loss_sample_count`. To change other parameters, you need to modify them in their respective configuration files.

We welcome community contributions. For details please check our [contributing guidelines](CONTRIBUTING.md).
## Acknowledgements
This repository is build upon the [RePlay repository]
(https://github.com/sb-ai-lab/RePlay/tree/main). Triton kernels is based on the code of [ml-cross-entropy](
https://github.com/apple/ml-cross-entropy/tree/main).

9 changes: 9 additions & 0 deletions clean.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

rm -rf output_log.log

rm -rf lightning_logs

rm -rf replay_benchmarks/artifacts

rm -rf replay_benchmarks/__pycache__
1 change: 1 addition & 0 deletions kernels/cut_cross_entropy/cce.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
# This software includes modifications
from dataclasses import dataclass
from typing import cast

Expand Down
1 change: 1 addition & 0 deletions kernels/cut_cross_entropy/cce_backward.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
# This software includes modifications
import torch
import triton
import triton.language as tl
Expand Down
1 change: 1 addition & 0 deletions kernels/cut_cross_entropy/cce_lse_forward.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
# This software includes modifications
from typing import Literal, overload

import torch
Expand Down
1 change: 1 addition & 0 deletions kernels/cut_cross_entropy/linear_cross_entropy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
# This software includes modifications
import enum
import platform
from enum import auto
Expand Down
1 change: 1 addition & 0 deletions kernels/cut_cross_entropy/tl_autotune.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright (C) 2024 Apple Inc. All Rights Reserved.
# This software includes modifications
import functools
import heapq
import os
Expand Down
1 change: 1 addition & 0 deletions kernels/fused_linear_cross_entropy/fused_linear_ce_loss.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py
import torch
import triton
import triton.language as tl
Expand Down
Empty file.
24 changes: 0 additions & 24 deletions kernels/multinomial_sampling/cuda_multinomial.cpp

This file was deleted.

49 changes: 0 additions & 49 deletions kernels/multinomial_sampling/cuda_multinomial_kernel.cu

This file was deleted.

Loading
Loading