This project implements a character-level transformer-based language model from scratch using PyTorch. The model is designed to generate text by predicting the next character in a sequence based on previous context. Unlike traditional NLP models, this approach does not rely on word embeddings but instead learns representations directly from character sequences.
- Implements self-attention with multiple attention heads
- Uses position embeddings to capture sequence order
- Includes a feedforward network for enhanced representation learning
- Utilizes layer normalization for stable training
- Trained using cross-entropy loss with token-level predictions
- Can generate new sequences based on a given context
The model architecture consists of the following key components:
The model begins with two embedding layers:
- Token Embeddings: Each character in the input text is mapped to a high-dimensional space using
nn.Embedding. - Position Embeddings: To account for the order of characters in a sequence, a position embedding layer is added to encode the position of each token. This is crucial as transformers do not inherently understand sequence order.
The core of the model is the self-attention mechanism, which enables each character to interact with all other characters in the sequence, forming contextual relationships.
-
For each token, the mechanism generates query (Q), key (K), and value (V) vectors.
-
The attention score is calculated using the following formula:
Attention = softmax(QK^T / sqrt(d_k)) VWhere:
- Q represents the query matrix.
- K represents the key matrix.
- V represents the value matrix.
d_kis the dimensionality of the key vectors.
-
To prevent the model from attending to future tokens during training, a lower-triangular mask is applied to the attention scores, ensuring information flow only from past tokens.
Instead of relying on a single attention mechanism, the model employs multiple attention heads operating in parallel. Each head learns different aspects of the relationships between tokens. The outputs from all the attention heads are then concatenated and linearly projected back into the original embedding space.
Each transformer block incorporates a feedforward network to further process the representations. This network consists of:
- A linear layer that expands the dimensionality of the input.
- A ReLU (Rectified Linear Unit) activation function, introducing non-linearity.
- A second linear layer that projects the representation back to the original dimensionality.
To facilitate training and improve gradient flow, residual connections are implemented around both the self-attention mechanism and the feedforward network. Additionally, layer normalization is applied after each of these components to stabilize the training process.
The final layer of the model maps the transformed embeddings to the size of the vocabulary (the total number of unique characters in the dataset). This layer produces logits, representing the unnormalized probabilities for each possible next character. The model is trained using cross-entropy loss, which measures the difference between the predicted probability distribution and the actual target character.
The training process involves the following steps:
- Dataset Preparation: The dataset is loaded from a text file, and each unique character is mapped to a unique integer index.
- Mini-Batch Training: The model is trained using mini-batches of sequences. Each sequence is fed into the network, and the cross-entropy loss is calculated by comparing the model's predictions with the actual next characters in the sequence.
- Optimization: The AdamW optimizer is used to update the model's parameters in order to minimize the calculated loss.
- Iteration and Evaluation: The training process runs for a predefined number of iterations. Periodically, the model's performance is evaluated on a separate validation dataset to monitor its generalization ability.
Once the model has been trained, it can generate new text sequences:
- Initialization: The generation process starts with an initial "seed" token or a short sequence of tokens.
- Prediction: The model takes the current sequence as input and predicts the probability distribution for the next character.
- Sampling: A character is sampled from the predicted probability distribution. This can be done using various strategies, such as taking the character with the highest probability or sampling based on the probabilities to introduce more randomness.
- Appending: The sampled character is appended to the current sequence.
- Repetition: Steps 2-4 are repeated for a fixed number of steps or until a specific stopping condition is met.
To run this project, follow these steps:
- Clone the Repository:
git clone <repository_url> cd <repository_directory>
- Install Dependencies:
pip install torch numpy
- Prepare Input Text File: Place your desired training text data in a file (e.g.,
input.txt). - Run Training Script: Execute the training script (e.g.,
train.py). You may need to adjust hyperparameters in the script.python train.py --input_file input.txt
- Generate Text: Once training is complete, use the generation script (e.g.,
generate.py) to create new text based on the trained model.(Note: You will need to createpython generate.py --model_path path/to/trained_model.pth --seed "some initial text" --num_tokens 100train.pyandgenerate.pyscripts based on the provided information.)
Potential areas for future development include:
- Implementing larger and more complex transformer architectures, such as GPT.
- Fine-tuning the model on specific domain datasets to generate more specialized text.
- Experimenting with different activation functions and optimization algorithms to potentially improve performance.
This project successfully demonstrates the application of the transformer architecture for character-level text generation. By leveraging self-attention, multi-head mechanisms, and feedforward networks, the model learns intricate relationships between characters, enabling the generation of coherent and contextually relevant text sequences. 🚀