T.REX is a powerful local tool for analyzing, deduplicating, clustering, and querying large-scale text datasets using pretrained embeddings and LLM-powered pipelines with metadata aware plots and integrations.
❗ Requires a machine with a dedicated NVIDIA GPU (CUDA 11.8+).
❌ Will NOT run in Docker, WSL, or headless environments due to GUI popups.
- ✅ CSV popup loader with column detection
- ✅ Embedding generation (
sentence-transformers) - ✅ Interactive EDA on metadata + selected text column
- ✅ Clustering and optional LLM labeling
- ✅ Near duplicate analysis
- ✅ Text-to-SQL pipeline
- ✅ Full CLI-based UX for pipeline chaining
- Python 3.10
- NVIDIA GPU with CUDA 11.8+ drivers installed
- Display environment (no WSL or remote/headless)
# Clone the repo
git clone https://github.com/gauthamnairvm/trex-app.git
cd trex-app
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
#Additional Dependencies
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
Create a .env file at the root (you can copy from .env.template):
GROQ_API_KEY=your_groq_api_key_here
python main.pyYou'll see the CSV loader and the T.REX CLI. Try:
T.REX > trex_eda(metadata=['col1', 'col2'])
T.REX > trex_cluster()
T.REX > trex_dedup(stopwords=False)
T.REX > trex_text2sql(pii_mask=True)trex-app/
├── app/
│ ├── clustering.py
│ ├── dedup.py
│ ├── embedding.py
│ ├── file_loader.py
│ ├── pipeline.py
│ └── text2sql_pipeline.py
├── data/ # CSVs and embeddings
├── results/ # Output plots and clustering results
├── main.py # Entry point
├── .env.template # Environment variable example
├── requirements.txt
└── README.md
This project is licensed under the MIT License.
You’re free to use, modify, and distribute it. Please give credit if you build on it.
TREX is open for issues, suggestions, and pull requests.
To contribute:
- Fork the repo
- Create a feature branch
- Submit a PR with proper description
Limited Pipelines: Only four pipelines are supported at present — EDA, Deduplication, Clustering, and Text2SQL.
File Format Restriction: Currently supports only .csv files. Other formats (e.g., Excel, JSON) are not yet supported.
Single Text Column Design: Each session supports only one designated text column, with the rest treated as metadata(if preferred). If a different text/metadata column is required, the file must be reloaded.
Startup Instability: Occasionally, the GUI file loader popup may fail on the first try. Restarting the session usually resolves the issue.
Fixed LLM Configuration: Uses a single Groq-hosted model (llama3-70b-8192). Prompts use hardcoded settings (temperature, max_tokens, stop), with no dynamic tuning. API key must be provided in the .env file for usage of pipelines with LLM integration.
Built and maintained by Variath Madhupal Gautham Nair (MSCS Rutgers University-New Brunswick)