Skip to content

MeidiLprog/HuggingFace-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

26 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– AutoData Agent โ€“ Autonomous Hugging Face AI for Data Cleaning & Modeling

An autonomous data science agent that cleans, explores and models your data intelligently ๐Ÿง 

Built with Hugging Face SmolAgents, Ollama, and Scikit-Learn


๐Ÿง  Project Overview

AutoData Agent is a modular and autonomous data assistant designed to:

  • ๐Ÿงน Inspect raw datasets (missing values, data types, anomalies)
  • ๐Ÿงผ Clean and preprocess data (imputation, encoding, scaling)
  • ๐Ÿ“Š Visualize key statistical properties (distributions, correlations, boxplots)
  • ๐Ÿค– Train machine learning models automatically (classification or regression)
  • โš™๏ธ Optimize the model using GridSearchCV to achieve best performance

This project demonstrates how a Hugging Face agent can orchestrate an end-to-end Data Science pipeline, making smart decisions and reasoning about the dataset structure.


๐Ÿงฉ Architecture

datacleaner-agent/
โ”œโ”€โ”€ app.py                 # Main entry point
โ”œโ”€โ”€ agent_logic.py         # Builds the Hugging Face Agent (model + tools)
โ”œโ”€โ”€ tools/
โ”‚   โ”œโ”€โ”€ inspect.py         # InspectTool: automatic EDA (plots + summary)
โ”‚   โ”œโ”€โ”€ cleaning.py        # CleaningTool: data cleaning & encoding
โ”‚   โ””โ”€โ”€ train.py           # TrainTool: automatic ML training + evaluation
โ”œโ”€โ”€ test_tools.py          # Local tests for each tool
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿš€ How It Works

๐Ÿ”น Step 1 โ€” InspectTool

Performs an Exploratory Data Analysis (EDA):

  • Displays dataset info (shape, dtypes, missing values)
  • Generates histograms, boxplots, and correlation heatmaps
  • Detects data imbalances and null distributions

Example output:

  • df.info(), df.describe()
  • Automatic visualizations for numeric variables
  • Summary of missing data

๐Ÿ”น Step 2 โ€” CleaningTool

Cleans and prepares the dataset:

  • Removes duplicates
  • Handles missing values (median or mode)
  • Encodes categorical variables (LabelEncoder or OneHotEncoder)
  • Scales numerical columns (StandardScaler)
  • Detects and drops low-variance features

Goal: produce a dataset ready for model training.


๐Ÿ”น Step 3 โ€” TrainTool

Automatically trains a model based on target variable type:

  • Detects whether itโ€™s classification or regression

  • Chooses the appropriate RandomForest model

  • Runs GridSearchCV to optimize hyperparameters

  • Splits data into train/test (80/20)

  • Displays key metrics:

    • Classification: Accuracy, Precision, Recall, F1
    • Regression: RMSE, Rยฒ

โš™๏ธ Installation

Prerequisites

  • Python โ‰ฅ 3.11
  • Virtual environment recommended
git clone https://github.com/MeidiLprog/datacleaner-agent.git
cd datacleaner-agent
pip install -r requirements.txt

๐Ÿงญ Usage

โ–ถ๏ธ Run the agent

python app.py

The agent will:

  1. Load the Titanic dataset (default)
  2. Inspect and clean it automatically
  3. Train a predictive model on Survived
  4. Output key metrics and model summary

Expected Output:

Dataset successfully loaded ! (891, 12)
Agent ready !
Inspecting dataset...
Cleaning done...
GridSearch training...
Accuracy: 0.84
F1 Score: 0.81

โ˜๏ธ Supported Execution Modes

๐ŸŸก Hugging Face Cloud (Recommended)

Runs the reasoning model via Hugging Face Inference API.

Set your token:

export HUGGINGFACEHUB_API_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxx"

Then in agent_logic.py:

model = LiteLLMModel(model_id="huggingface/mistralai/Mistral-7B-Instruct-v0.2")

๐Ÿ”ต Local (Offline) โ€“ Ollama

If you prefer running locally:

  1. Install Ollama

  2. Pull the model:

    ollama pull qwen2:1.5b
  3. Replace model in agent_logic.py:

    model = LiteLLMModel(model_id="ollama/qwen2:1.5b")

๐Ÿงฐ Technologies Used

Stack Purpose
๐Ÿค— Hugging Face SmolAgents Agent orchestration
๐Ÿ”ฎ LiteLLM / HfApiModel LLM reasoning
๐Ÿงน Pandas / Numpy Data wrangling
๐Ÿ“Š Matplotlib / Seaborn Data visualization
โš™๏ธ Scikit-Learn Model training & GridSearch
๐Ÿ’ป Ollama Local LLM inference (offline mode)

๐Ÿ’ก Example Screenshots

Visualization Description
EDA Automatic data histograms & boxplots
Heatmap Correlation matrix
Training Model training output

๐Ÿง‘โ€๐Ÿ’ป Author

Lefki Meidi ๐ŸŽ“ Data Science & Machine Learning Engineer ๐Ÿ’ฌ LinkedIn โ€ข GitHub โ€ข HuggingFace


๐ŸŒŸ Project Highlights

  • Built entirely from scratch in less than 24h
  • Modular architecture (plug & play tools)
  • Hugging Face AI agent integrated locally and via cloud API
  • Fully autonomous workflow: from raw data โ†’ cleaned dataset โ†’ trained model
  • Ideal for data preprocessing automation or teaching agent reasoning

โค๏ธ Acknowledgements

Special thanks to Hugging Face for the SmolAgents framework, and the open-source community for making AI accessible.

โ€œWhy spend hours cleaning data when your agent can do it for you?โ€

About

A repository to store away my projects on HuggingFace

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors