A comprehensive framework for learning and building Reinforcement Learning systems. This repository covers theory, algorithms, and modular implementations ranging from tabular methods to modern Deep RL architectures.
- 1. Introduction
- 2. Types of Reinforcement Learning
- 3. Installation
- 4. Usage Examples
- 5. Notebook Index
- 6. Projects
- 7. Core RL Equations
- 8. Contributing
- 9. License
- 10. Connect
Reinforcement Learning (RL) is a computational approach where an agent interacts with an environment to maximize cumulative reward. Unlike supervised learning, RL relies on trial-and-error interaction and feedback rather than labeled datasets.
RL focuses on solving sequential decision-making problems to find the optimal strategy (policy) for mapping states to actions.
| Domain | Technology | Examples |
|---|---|---|
| Robotics | Control Theory, Sim2Real | Boston Dynamics Atlas, Robot Arms |
| Gaming | Game Theory, Tree Search | AlphaGo, OpenAI Five, Dota 2 |
| Finance | Time Series, Optimization | Algorithmic Trading, Portfolio Mgmt |
| Autonomous Systems | Path Planning, SLAM | Self-Driving Cars, Drone Navigation |
| LLMs | RLHF | ChatGPT, Claude, Gemini |
- Q-Learning: Off-policy Temporal Difference control.
- SARSA: On-policy Temporal Difference control.
- DQN (Deep Q-Network): Combines Q-learning with deep neural networks.
- Dueling DQN: Separates state value and advantage.
- Double DQN: Reduces overestimation bias.
- REINFORCE: Monte Carlo Policy Gradient.
- A2C/A3C (Actor-Critic): Synchronous/Asynchronous Advantage Actor-Critic.
- PPO (Proximal Policy Optimization): Stable and efficient policy updates.
- DDPG: Deep Deterministic Policy Gradient for continuous control.
- SAC (Soft Actor-Critic): Entropy-regularized RL.
- Dyna-Q: Integrated planning, acting, and learning.
- Monte Carlo Tree Search (MCTS): Planning method used in AlphaGo.
- World Models: Learning environment dynamics.
# Clone the repository
git clone https://github.com/mohd-faizy/Reinforcement_learning.git
cd Reinforcement_learning
# Create virtual environment
uv venv
# Activate environment
# Linux/Mac:
source .venv/bin/activate
# Windows:
.venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txtClick to expand
# Clone repository
git clone https://github.com/mohd-faizy/Reinforcement_learning.git
cd Reinforcement_learning
# Create virtual environment
python -m venv rl_env
# Activate environment
# Linux/Mac:
source rl_env/bin/activate
# Windows:
rl_env\Scripts\activate
# Install dependencies
pip install -r requirements.txtRunning a Q-Learning Agent
import gymnasium as gym
from algorithms.q_learning import QLearningAgent
# Initialize Environment
env = gym.make('FrozenLake-v1', render_mode='human')
# Initialize and Train Agent
agent = QLearningAgent(env)
agent.train(episodes=1000)
# Test Agent
state, _ = env.reset()
done = False
while not done:
action = agent.predict(state)
state, reward, done, _, _ = env.step(action)
env.render()Launching Notebooks
jupyter labNavigate to 00_RL_intro.ipynb to start.
| No. | Notebook | Topic |
|---|---|---|
| 00 | RL Intro | Fundamentals of Reinforcement Learning |
| 01 | Markov Decisions | MDPs, States, Actions, and Rewards |
| 02 | State & Action | Bellman Equations and Value Functions |
| 03 | Policy & Value | Dynamic Programming methods |
| 04 | Summary | Recap of DP and Value Functions |
| 05 | Monte Carlo | MC Prediction and Control |
| 06 | TD Learning | Temporal Difference Learning |
| 07 | MC & TD | Comparison of MC and TD methods |
| 08 | Model-Free RL | n-step bootstrapping and more |
| 09 | Project: Taxi | Taxi Route Optimization |
| 10 | Deep RL Intro | Neural Networks in RL |
| 11 | DQN Improved | Replay Buffer and Target Networks |
| 12 | Policy Grad | REINFORCE and Actor-Critic |
| 13 | PPO | Proximal Policy Optimization |
| 14 | Project: Stocks | Stock Trading Bot |
| 15 | RLHF Intro | RL from Human Feedback |
| 16 | Feedback | Compiling preference datasets |
| 17 | Reward Model | Training a Reward Model |
| 18 | Metrics | Evaluating RLHF performance |
| 19 | Project: RLHF | RLHF Pipeline |
| Project | Description | Tech Stack |
|---|---|---|
| Taxi Route Optimization | Optimizing taxi dispatch using Q-Learning. | NumPy, Gymnasium |
| Stock Trading Bot | Automated trading agent using DRL. | Stable-Baselines3, Pandas |
| RLHF Pipeline | Reward Modeling and fine-tuning. | Transformers, TRL |
All equations below are taken directly from the notebooks in this repository, organized by topic with sequential numbering.
From 00_RL_intro
(1) Discounted Return (expanded)
(2) Discounted Return (compact)
From 01_Markov_Decision_Processes
(3) Markov Property
(4) State-Value Function
(5) State-Value (expanded)
(6) Bellman Equation for
(7) Bellman Equation for
(8) Action-Value Function
(9)
(10)
(11) Bellman Equation for
(12)
(13) Bellman Optimality Equation for
(14) Bellman Optimality Equation for
(15) Optimal Policy from
From 03_Policy_&_Value_Iteration
(16) Policy Evaluation (Bellman Expectation)
(17) Policy Improvement
(18) Value Iteration Update
(19) Bellman Optimality Operator
(20) Greedy Policy Extraction
(21) Return from time
(22) Action-Value as Expected Return
(23) MC Incremental Mean Update
(24) Episode Trajectory
From 06_Temporal_Difference_Learning
(25) TD Error
(26) SARSA Update (On-Policy)
(27) Q-Learning Update (Off-Policy)
(28) Optimal Policy (Greedy)
(29) Boltzmann (Softmax) Exploration
From 07_Monte_Carlo_&_TD
(30) SARSA (alternative form)
(31) Q-Learning (alternative form)
From 08_Adv_Model_Free_RL
(32) Expected SARSA
(33) Double Q-Learning Update (Network 1)
(34) Double Q-Learning Update (Network 2)
From 10_Intro_Deep_RL
(35) Deterministic Policy
(36) Stochastic Policy
(37) Trajectory
(38) Trajectory Return
(39) Greedy Action Selection
(40) Bellman Equation (Q-form)
(41) DQN Loss
From 11_DQN_with_Exp_Replay_&_improvements
(42) Epsilon Decay Schedule
(43) Target Network Update
(44) Prioritized Replay — Priority
(45) Prioritized Replay — Sampling Probability
(46) Importance-Sampling Weight
From 12_Intro_Policy_Grad_&_Actor_Critic
(47) Policy Objective
(48) Policy Gradient Theorem
(49) REINFORCE Loss
(50) Critic Loss (TD Error)
(51) Actor Loss (Advantage)
From 13_PPO
(52) Probability Ratio
(53) Clipped Ratio
(54) PPO Clipped Objective
(55) Entropy Bonus
From 15_Intro_RLHF
(56) RLHF Core Idea
From Q_vs_DQN
(57) Tabular Q-Learning
(58) DQN Function Approximation
(59) DQN Loss with Target Network
| Symbol | Description |
|---|---|
| Current state, next state | |
| Current action, next action | |
| Reward | |
| Discount factor |
|
| Learning rate | |
| Policy — probability of action |
|
| Parameterized policy | |
| State-value function under policy |
|
| Action-value function under policy |
|
| Optimal state-value function | |
| Optimal action-value function | |
| Return: |
|
| TD error | |
| Advantage estimate | |
| Trajectory or temperature | |
| Exploration rate (ε-greedy) or clipping parameter | |
| Network parameters, target network parameters | |
| KL penalty coefficient | |
| Sigmoid: |
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.

