Skip to content

gitnoob101/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Transformer-Based Language Model

Introduction

This project implements a character-level transformer-based language model from scratch using PyTorch. The model is designed to generate text by predicting the next character in a sequence based on previous context. Unlike traditional NLP models, this approach does not rely on word embeddings but instead learns representations directly from character sequences.

Features

  • Implements self-attention with multiple attention heads
  • Uses position embeddings to capture sequence order
  • Includes a feedforward network for enhanced representation learning
  • Utilizes layer normalization for stable training
  • Trained using cross-entropy loss with token-level predictions
  • Can generate new sequences based on a given context

Model Architecture

The model architecture consists of the following key components:

1. Embedding Layer

The model begins with two embedding layers:

  • Token Embeddings: Each character in the input text is mapped to a high-dimensional space using nn.Embedding.
  • Position Embeddings: To account for the order of characters in a sequence, a position embedding layer is added to encode the position of each token. This is crucial as transformers do not inherently understand sequence order.

2. Self-Attention Mechanism

The core of the model is the self-attention mechanism, which enables each character to interact with all other characters in the sequence, forming contextual relationships.

  • For each token, the mechanism generates query (Q), key (K), and value (V) vectors.

  • The attention score is calculated using the following formula:

    Attention = softmax(QK^T / sqrt(d_k)) V
    

    Where:

    • Q represents the query matrix.
    • K represents the key matrix.
    • V represents the value matrix.
    • d_k is the dimensionality of the key vectors.
  • To prevent the model from attending to future tokens during training, a lower-triangular mask is applied to the attention scores, ensuring information flow only from past tokens.

3. Multi-Head Attention

Instead of relying on a single attention mechanism, the model employs multiple attention heads operating in parallel. Each head learns different aspects of the relationships between tokens. The outputs from all the attention heads are then concatenated and linearly projected back into the original embedding space.

4. Feedforward Network

Each transformer block incorporates a feedforward network to further process the representations. This network consists of:

  • A linear layer that expands the dimensionality of the input.
  • A ReLU (Rectified Linear Unit) activation function, introducing non-linearity.
  • A second linear layer that projects the representation back to the original dimensionality.

5. Residual Connections & Layer Normalization

To facilitate training and improve gradient flow, residual connections are implemented around both the self-attention mechanism and the feedforward network. Additionally, layer normalization is applied after each of these components to stabilize the training process.

6. Output Layer

The final layer of the model maps the transformed embeddings to the size of the vocabulary (the total number of unique characters in the dataset). This layer produces logits, representing the unnormalized probabilities for each possible next character. The model is trained using cross-entropy loss, which measures the difference between the predicted probability distribution and the actual target character.

Training Process

The training process involves the following steps:

  1. Dataset Preparation: The dataset is loaded from a text file, and each unique character is mapped to a unique integer index.
  2. Mini-Batch Training: The model is trained using mini-batches of sequences. Each sequence is fed into the network, and the cross-entropy loss is calculated by comparing the model's predictions with the actual next characters in the sequence.
  3. Optimization: The AdamW optimizer is used to update the model's parameters in order to minimize the calculated loss.
  4. Iteration and Evaluation: The training process runs for a predefined number of iterations. Periodically, the model's performance is evaluated on a separate validation dataset to monitor its generalization ability.

Text Generation

Once the model has been trained, it can generate new text sequences:

  1. Initialization: The generation process starts with an initial "seed" token or a short sequence of tokens.
  2. Prediction: The model takes the current sequence as input and predicts the probability distribution for the next character.
  3. Sampling: A character is sampled from the predicted probability distribution. This can be done using various strategies, such as taking the character with the highest probability or sampling based on the probabilities to introduce more randomness.
  4. Appending: The sampled character is appended to the current sequence.
  5. Repetition: Steps 2-4 are repeated for a fixed number of steps or until a specific stopping condition is met.

Installation & Usage

To run this project, follow these steps:

  1. Clone the Repository:
    git clone <repository_url>
    cd <repository_directory>
  2. Install Dependencies:
    pip install torch numpy
  3. Prepare Input Text File: Place your desired training text data in a file (e.g., input.txt).
  4. Run Training Script: Execute the training script (e.g., train.py). You may need to adjust hyperparameters in the script.
    python train.py --input_file input.txt
  5. Generate Text: Once training is complete, use the generation script (e.g., generate.py) to create new text based on the trained model.
    python generate.py --model_path path/to/trained_model.pth --seed "some initial text" --num_tokens 100
    (Note: You will need to create train.py and generate.py scripts based on the provided information.)

Future Improvements

Potential areas for future development include:

  • Implementing larger and more complex transformer architectures, such as GPT.
  • Fine-tuning the model on specific domain datasets to generate more specialized text.
  • Experimenting with different activation functions and optimization algorithms to potentially improve performance.

Conclusion

This project successfully demonstrates the application of the transformer architecture for character-level text generation. By leveraging self-attention, multi-head mechanisms, and feedforward networks, the model learns intricate relationships between characters, enabling the generation of coherent and contextually relevant text sequences. 🚀

About

a decoder only transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages