Skip to content

mohd-faizy/Reinforcement_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reinforcement Learning

Reinforcement Learning

Author Python PyTorch Gymnasium License

A comprehensive framework for learning and building Reinforcement Learning systems. This repository covers theory, algorithms, and modular implementations ranging from tabular methods to modern Deep RL architectures.

RL Overview

Table of Contents


1. Introduction

Reinforcement Learning (RL) is a computational approach where an agent interacts with an environment to maximize cumulative reward. Unlike supervised learning, RL relies on trial-and-error interaction and feedback rather than labeled datasets.

Core Principles

RL focuses on solving sequential decision-making problems to find the optimal strategy (policy) for mapping states to actions.

Domain Technology Examples
Robotics Control Theory, Sim2Real Boston Dynamics Atlas, Robot Arms
Gaming Game Theory, Tree Search AlphaGo, OpenAI Five, Dota 2
Finance Time Series, Optimization Algorithmic Trading, Portfolio Mgmt
Autonomous Systems Path Planning, SLAM Self-Driving Cars, Drone Navigation
LLMs RLHF ChatGPT, Claude, Gemini

2. Types of Reinforcement Learning

Model-Free RL (Tabular & Deep)

  • Q-Learning: Off-policy Temporal Difference control.
  • SARSA: On-policy Temporal Difference control.
  • DQN (Deep Q-Network): Combines Q-learning with deep neural networks.
  • Dueling DQN: Separates state value and advantage.
  • Double DQN: Reduces overestimation bias.

Policy Gradient Methods

  • REINFORCE: Monte Carlo Policy Gradient.
  • A2C/A3C (Actor-Critic): Synchronous/Asynchronous Advantage Actor-Critic.
  • PPO (Proximal Policy Optimization): Stable and efficient policy updates.
  • DDPG: Deep Deterministic Policy Gradient for continuous control.
  • SAC (Soft Actor-Critic): Entropy-regularized RL.

Model-Based RL

  • Dyna-Q: Integrated planning, acting, and learning.
  • Monte Carlo Tree Search (MCTS): Planning method used in AlphaGo.
  • World Models: Learning environment dynamics.

3. Installation

Using UV (Recommended)

# Clone the repository
git clone https://github.com/mohd-faizy/Reinforcement_learning.git
cd Reinforcement_learning

# Create virtual environment
uv venv

# Activate environment
# Linux/Mac:
source .venv/bin/activate
# Windows:
.venv\Scripts\activate

# Install dependencies
uv pip install -r requirements.txt

Alternative Installation (pip)

Click to expand
# Clone repository
git clone https://github.com/mohd-faizy/Reinforcement_learning.git
cd Reinforcement_learning

# Create virtual environment
python -m venv rl_env

# Activate environment
# Linux/Mac:
source rl_env/bin/activate
# Windows:
rl_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

4. Usage Examples

Running a Q-Learning Agent

import gymnasium as gym
from algorithms.q_learning import QLearningAgent

# Initialize Environment
env = gym.make('FrozenLake-v1', render_mode='human')

# Initialize and Train Agent
agent = QLearningAgent(env)
agent.train(episodes=1000)

# Test Agent
state, _ = env.reset()
done = False
while not done:
    action = agent.predict(state)
    state, reward, done, _, _ = env.step(action)
    env.render()

Launching Notebooks

jupyter lab

Navigate to 00_RL_intro.ipynb to start.


5. Notebook Index

No. Notebook Topic
00 RL Intro Fundamentals of Reinforcement Learning
01 Markov Decisions MDPs, States, Actions, and Rewards
02 State & Action Bellman Equations and Value Functions
03 Policy & Value Dynamic Programming methods
04 Summary Recap of DP and Value Functions
05 Monte Carlo MC Prediction and Control
06 TD Learning Temporal Difference Learning
07 MC & TD Comparison of MC and TD methods
08 Model-Free RL n-step bootstrapping and more
09 Project: Taxi Taxi Route Optimization
10 Deep RL Intro Neural Networks in RL
11 DQN Improved Replay Buffer and Target Networks
12 Policy Grad REINFORCE and Actor-Critic
13 PPO Proximal Policy Optimization
14 Project: Stocks Stock Trading Bot
15 RLHF Intro RL from Human Feedback
16 Feedback Compiling preference datasets
17 Reward Model Training a Reward Model
18 Metrics Evaluating RLHF performance
19 Project: RLHF RLHF Pipeline

6. Projects

Project Description Tech Stack
Taxi Route Optimization Optimizing taxi dispatch using Q-Learning. NumPy, Gymnasium
Stock Trading Bot Automated trading agent using DRL. Stable-Baselines3, Pandas
RLHF Pipeline Reward Modeling and fine-tuning. Transformers, TRL

7. Core RL Equations

All equations below are taken directly from the notebooks in this repository, organized by topic with sequential numbering.


7.1 Foundations & Returns

From 00_RL_intro

(1) Discounted Return (expanded)

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots + \gamma^{T-t-1} R_T$$

(2) Discounted Return (compact)

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$


7.2 Markov Decision Processes

From 01_Markov_Decision_Processes

(3) Markov Property

$$P(S_{t+1} = s' | S_t = s, A_t = a, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} = s' | S_t = s, A_t = a)$$


7.3 State & Action Value Functions

From 02_State_&_Action_value

(4) State-Value Function

$$V^{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]$$

(5) State-Value (expanded)

$$V^{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s]$$

(6) Bellman Equation for $V^{\pi}$ (stochastic policy)

$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^{\pi}(s') \right]$$

(7) Bellman Equation for $V^{\pi}$ (deterministic policy)

$$V^{\pi}(s) = \sum_{s'} P(s'|s,\pi(s)) \left[ R(s,\pi(s),s') + \gamma V^{\pi}(s') \right]$$

(8) Action-Value Function

$$Q^\pi(s,a) = \mathbb{E}_\pi\big[G_t \mid S_t=s, A_t=a\big]$$

(9) $Q$ in terms of $V$ (single successor)

$$Q^\pi(s,a) = r_a + \gamma V^\pi(s')$$

(10) $Q$ in terms of $V$ (stochastic transitions)

$$Q^\pi(s,a) = \sum_{s'} P(s'\mid s,a),\big[ R(s,a,s') + \gamma V^\pi(s')\big]$$

(11) Bellman Equation for $Q^{\pi}$

$$Q^\pi(s,a) = \sum_{s'} P(s'\mid s,a)\Big[ R(s,a,s') + \gamma \sum_{a'} \pi(a'\mid s'),Q^\pi(s',a')\Big]$$

(12) $V$ in terms of $Q$

$$V^\pi(s) = \sum_a \pi(a\mid s),Q^\pi(s,a)$$

(13) Bellman Optimality Equation for $V^\ast$

$$V^\ast(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\ast(s')]$$

(14) Bellman Optimality Equation for $Q^\ast$

$$Q^\ast(s,a) = \sum_{s'} P(s'\mid s,a)\Big[ R(s,a,s') + \gamma \max_{a'} Q^\ast(s',a')\Big]$$

(15) Optimal Policy from $Q^\ast$

$$\pi^\ast(s) = \arg\max_a Q^\ast(s,a)$$


7.4 Dynamic Programming

From 03_Policy_&_Value_Iteration

(16) Policy Evaluation (Bellman Expectation)

$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{s',r} P(s'|s,a)[r + \gamma V^{\pi}(s')]$$

(17) Policy Improvement

$$\pi'(s) = \arg\max_a Q^{\pi}(s,a)$$

(18) Value Iteration Update

$$V_{k+1}(s) = \max_a \sum_{s',r} P(s'|s,a)[r + \gamma V_k(s')]$$

(19) Bellman Optimality Operator

$$(T^\ast V)(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V(s')]$$

(20) Greedy Policy Extraction

$$\pi(s) = \arg\max_a \sum_{s',r} P(s'|s,a)[r + \gamma V(s')]$$


7.5 Monte Carlo Methods

From 05_Monte_Carlo_Methods

(21) Return from time $t$

$$G_t = \sum_{k=t+1}^{T} \gamma^{k-(t+1)} R_k$$

(22) Action-Value as Expected Return

$$Q(s,a) = \mathbb{E}[G_t \mid S_t = s, A_t = a]$$

(23) MC Incremental Mean Update

$$Q(s, a) \gets \frac{1}{N(s, a)} \sum_{i=1}^{N(s, a)} G_i(s, a)$$

(24) Episode Trajectory

$$(S_0, A_0, R_1), (S_1, A_1, R_2), \ldots, (S_{T-1}, A_{T-1}, R_T)$$


7.6 Temporal Difference Learning

From 06_Temporal_Difference_Learning

(25) TD Error

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

(26) SARSA Update (On-Policy)

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$$

(27) Q-Learning Update (Off-Policy)

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t)]$$

(28) Optimal Policy (Greedy)

$$\pi^\ast(s) = \arg\max_a Q^\ast(s,a)$$

(29) Boltzmann (Softmax) Exploration

$$\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$$


7.7 MC vs TD Comparison

From 07_Monte_Carlo_&_TD

(30) SARSA (alternative form)

$$Q(s, a) = (1 - \alpha)Q(s, a) + \alpha [r + \gamma Q(s', a')]$$

(31) Q-Learning (alternative form)

$$Q(s, a) = (1 - \alpha)Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a')]$$


7.8 Advanced Model-Free RL

From 08_Adv_Model_Free_RL

(32) Expected SARSA

$$E{Q(s', A)} = \sum_{a \in A} \pi(a|s') \cdot Q(s', a)$$

(33) Double Q-Learning Update (Network 1)

$$Q_0(s, a) = (1 - \alpha)Q_0(s, a) + \alpha[r + \gamma Q_1(s', \text{argmax}_a Q_0(s', a))]$$

(34) Double Q-Learning Update (Network 2)

$$Q_1(s, a) = (1 - \alpha)Q_1(s, a) + \alpha[r + \gamma Q_0(s', \text{argmax}_a Q_1(s', a))]$$


7.9 Deep RL Foundations

From 10_Intro_Deep_RL

(35) Deterministic Policy

$$a_t = \pi(s_t)$$

(36) Stochastic Policy

$$\pi(a|s) = P(A_t=a | S_t=s)$$

(37) Trajectory

$$\tau = ((s_0, a_0), (s_1, a_1), ..., (s_T, a_T))$$

(38) Trajectory Return

$$R_\tau = \sum_{t=0}^{T} \gamma^t r_t$$

(39) Greedy Action Selection

$$\pi(s_t) = \arg \max_a Q(s_t, a)$$

(40) Bellman Equation (Q-form)

$$Q(s_t, a_t) = r_{t+1} + \gamma \max_{a} Q(s_{t+1}, a)$$

(41) DQN Loss

$$\text{Loss} = \left( Q(s_t, a_t) - (r_{t+1} + \gamma \max_a Q(s_{t+1}, a)) \right)^2$$


7.10 DQN with Experience Replay & Improvements

From 11_DQN_with_Exp_Replay_&_improvements

(42) Epsilon Decay Schedule

$$\epsilon = end + (start - end) \cdot e^{-\frac{step}{decay}}$$

(43) Target Network Update

$$\text{Target} = r + \gamma \max \hat{Q}(s_{t+1}, a)$$

(44) Prioritized Replay — Priority

$$p_i = |\delta_i| + \epsilon$$

(45) Prioritized Replay — Sampling Probability

$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

(46) Importance-Sampling Weight

$$w_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\beta$$


7.11 Policy Gradient & Actor-Critic

From 12_Intro_Policy_Grad_&_Actor_Critic

(47) Policy Objective

$$J(\pi_\theta) = E_{\tau \sim \pi_\theta} [ R_\tau ]$$

(48) Policy Gradient Theorem

$$\nabla_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta} \left( R_\tau \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right)$$

(49) REINFORCE Loss

$$\mathcal{L}(\theta) = - R_\tau \sum_{t=0}^T \log \pi_\theta(a_t|s_t)$$

(50) Critic Loss (TD Error)

$$L_c(\theta_c) = ( (r_t + \gamma V_{\theta_c}(s_{t+1})) - V_{\theta_c}(s_t) )^2$$

(51) Actor Loss (Advantage)

$$L_a(\theta) = - \log \pi_\theta(a_t|s_t) \times \underbrace{( (r_t + \gamma V(s_{t+1})) - V(s_t) )}_{\text{TD Error / Advantage}}$$


7.12 Proximal Policy Optimization (PPO)

From 13_PPO

(52) Probability Ratio

$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

(53) Clipped Ratio

$$\text{clipped ratio} = \text{clip}(r_t(\theta),; 1-\epsilon,; 1+\epsilon)$$

(54) PPO Clipped Objective

$$L^{CLIP}(\theta) = \hat{E}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]$$

(55) Entropy Bonus

$$H(X) = - \sum p(x) \log_2 p(x)$$


7.13 RLHF — Reinforcement Learning from Human Feedback

From 15_Intro_RLHF

(56) RLHF Core Idea

$$\text{Model} + \text{Human Feedback} \rightarrow \text{Improved Output}$$


7.14 Q-Learning vs DQN Reference

From Q_vs_DQN

(57) Tabular Q-Learning

$$Q(s,a) \leftarrow Q(s,a) + \alpha \Big[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \Big]$$

(58) DQN Function Approximation

$$Q(s,a;\theta) \approx Q(s,a)$$

(59) DQN Loss with Target Network

$$L(\theta) = \Big[ r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \Big]^2$$


7.15 Notation Reference

Symbol Description
$s, s'$ Current state, next state
$a, a'$ Current action, next action
$r, R_t$ Reward
$\gamma$ Discount factor $\in [0, 1]$
$\alpha$ Learning rate
$\pi(a|s)$ Policy — probability of action $a$ in state $s$
$\pi_\theta$ Parameterized policy
$V^{\pi}(s)$ State-value function under policy $\pi$
$Q^{\pi}(s,a)$ Action-value function under policy $\pi$
$V^\ast(s)$ Optimal state-value function
$Q^\ast(s,a)$ Optimal action-value function
$G_t$ Return: $\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
$\delta_t$ TD error
$A_t, \hat{A}_t$ Advantage estimate
$\tau$ Trajectory or temperature
$\epsilon$ Exploration rate (ε-greedy) or clipping parameter
$\theta, \theta^-$ Network parameters, target network parameters
$\beta$ KL penalty coefficient
$\sigma(z)$ Sigmoid: $1/(1 + e^{-z})$

8. Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

9. License

This project is licensed under the MIT License - see the LICENSE file for details.


10. Connect

Twitter LinkedIn Stack Exchange GitHub

About

Exploring the fundamentals of reinforcement learning (RL) to build agents capable of navigating complex real-world environments and enhancing the training of large language models (LLMs)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors