MLNA — Multi-Layer Network Analysis

Graph-based ML · Automatic Categorical Attributes Selection and Class-Based Personalized Multilayer Graph Feature Engineering for Supervised Machine Learning

Overview

MLNA is a research pipeline that transforms tabular classification datasets into multilayer graphs, extracts PageRank-based descriptors (global and personalized), trains 8 ML classifiers on classic + graph-enriched features, and generates LaTeX/PDF statistical comparison reports.

Key features:

Three graph modes: MLNA-1 (monolayer), MLNA-K (combinatorial pairs), MLNA-TOP-K (top selected variables)
PageRank descriptors: global (GLO) and personalized (PER), with/without class nodes
8 classifiers trained with and without SMOTE; cost-sensitive evaluation per alpha value
Parallel execution via detached screen sessions (one per alpha)
Automated LaTeX report generation with cross-dataset statistical analysis
Optional email notifications for long-running experiments

Project Structure

M2_thesis/
├── mlna_experiment/                    # Main experiment directory
│   ├── scripts/                        # Pipeline scripts (run as Python modules)
│   │   ├── 01_data_preprocessing.py    # Load raw CSV, EDA, preprocess, stratified sample
│   │   ├── 02_data_split.py            # Train/test split satisfying a performance threshold
│   │   ├── 03_graph_construction.py    # Build multilayer graphs; extract PageRank descriptors
│   │   ├── 04_model_training.py        # Train 8 classifiers; evaluate with/without SMOTE
│   │   ├── 05_report_generation.py     # Aggregate results; produce LaTeX/PDF reports
│   │   └── cpu_limitation_usage.py     # CPU resource management
│   ├── modules/                        # Core functionality modules
│   │   ├── graph.py                    # Multilayer graph construction (NetworkX), PageRank
│   │   ├── modeling.py                 # Classifier training, SMOTE, evaluation, SHAP
│   │   ├── preprocessing.py            # Data cleaning, encoding, combination generation
│   │   ├── statistical.py              # Cross-dataset analysis; result aggregation
│   │   ├── report.py                   # Report generation helpers
│   │   ├── eda.py                      # Exploratory data analysis
│   │   ├── file.py                     # File I/O utilities
│   │   ├── mailing.py                  # Email notifications (SMTP/Gmail)
│   │   └── env.py                      # .env loader
│   ├── configs/                        # Dataset-specific configurations
│   │   └── <DatasetName>/
│   │       └── config.ini
│   ├── data/                           # Input datasets
│   │   └── raw/<DatasetName>/          # Raw CSV per dataset
│   ├── results/                        # Experiment outputs (per dataset / alpha / fold)
│   ├── reports/                        # Generated LaTeX and PDF reports
│   ├── logs/                           # Execution logs
│   ├── .env_mlna/                      # Python virtual environment
│   ├── requirements.txt                # Python dependencies
│   ├── example.env                     # Template for .env
│   ├── env.sh                          # Bash environment variables (alphas, cwd)
│   ├── launch.sh                       # Main pipeline launcher (per dataset)
│   ├── env_setup.sh                    # One-time venv setup
│   ├── latex_install.sh                # LaTeX installation helper
│   ├── kill_screens.sh                 # Kill detached screen sessions
│   ├── stop.sh                         # Emergency stop
│   ├── Makefile                        # Convenience targets for all datasets
│   └── README.md                       # Internal quick-start reference
├── scripting/                          # Auxiliary scripts
├── BRAINSTORMING.md                    # Architecture analysis and optimization notes
├── CHANGELOG.md                        # Version history
├── CLAUDE.md                           # Claude Code project guide
├── .gitignore
├── LICENSE
└── README.md                           # This file

Prerequisites

Required:

Python 3.9 or higher
pip
GNU Screen (for background parallel execution)
GNU Make
Bash shell (Linux / macOS)
GNU Parallel (Linux / macOS)

Optional:

LaTeX distribution for PDF report generation
- TeX Live (Linux): sudo apt-get install texlive-full
- MacTeX (macOS): brew install --cask mactex
Gmail account with an App Password (for email notifications)

Recommended system resources:

8 GB RAM minimum (16 GB+ for large datasets)
Multi-core CPU — parallelism is controlled via MAX_CORE in .env
10 GB+ free disk space

Installation

1. Clone the repository

git clone git@github.com:VictorNico/multilayer-graph-feature-extraction.git
cd multilayer-graph-feature-extraction

2. Install system dependencies

Ubuntu / Debian:

sudo apt-get update
sudo apt-get install -y screen make python3 python3-venv python3-pip

macOS:

brew install screen make python3

3. Install LaTeX (optional — required for PDF reports)

cd mlna_experiment
chmod +x latex_install.sh && ./latex_install.sh

4. Set up the Python virtual environment

cd mlna_experiment
chmod +x env_setup.sh && ./env_setup.sh

This creates .env_mlna/ and installs all dependencies from requirements.txt. The environment is activated at the end of the script.

Manual alternative:

python3 -m venv .env_mlna
source .env_mlna/bin/activate
pip install --upgrade pip setuptools
pip install -r requirements.txt

Configuration

All commands below are run from mlna_experiment/.

1. Environment variables (`.env`)

Copy the template and fill in your values:

cp example.env .env

.env template:

# ==============================================
# EMAIL CONFIGURATION
# ==============================================

# Gmail credentials
GMAIL_USER=your.address@gmail.com
GMAIL_APP_PASSWORD=your_16_char_app_password

# SMTP settings (defaults to Gmail)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587

# Cost-sensitive alpha values (used by launch.sh)
ALPHAS="0.20, 0.50, 0.80"

# ==============================================
# EMAIL RECIPIENTS
# ==============================================

# Primary recipients (comma-separated)
EMAIL_RECIPIENTS=Name1 <email1@example.com>,Name2 <email2@example.com>

# CC recipients (optional)
EMAIL_CC=Name3 <email3@example.com>

# ==============================================
# COMPUTE RESOURCES
# ==============================================

MAX_CORE=5          # Maximum CPU cores per process
SIZE_DIVIDER=2      # Memory management: graph size divisor

To enable email notifications, generate a Gmail App Password at myaccount.google.com/apppasswords (requires 2-step verification).

2. Bash environment (`env.sh`)

cwd="$(pwd)"
alphas=(0.20 0.50 0.80)
graphWithClass=("True" "False")

Edit alphas here to change which alpha values are tested in parallel.

3. Dataset configuration (`config.ini`)

Each dataset requires configs/<DatasetName>/config.ini. Template based on the actual configs:

[GENERAL]
verbose = true
processed_dir = data/processed/
split_dir = data/splits/
results_dir = results/
report_dir = reports/
target_columns_type = cat

[DATA]
raw_path = data/raw/<DatasetName>/all.csv
domain = <3-LETTER-CODE>          # e.g. ADU, NUR, BAN
target = class                    # target column name in the CSV
dataset_delimiter = ,
index_col = None
na_values = ""
size = <total_row_count>

[PREPROCESSING]
to_remove =                       # comma-separated columns to drop (leave blank if none)
portion = <fraction>              # stratified fraction to keep, e.g. 0.077 for ~1 000 rows
encoding = utf-8

[SPLIT]
dataset_delimiter = ,
test_size = 0.2
random_state = 42
max_perf = 0.95                   # baseline accuracy threshold for split acceptance
index_col = 0

[GRAPH]
layers = 1
ohe_columns = <n_categorical>     # number of one-hot-encoded (categorical) features

[TRAINING]
cost = False
financialOption = {'amount': '<amount_col>', 'rate': '<rate_col>', 'duration': '<duration_col>'}
duration_divider = 12
rate_divider = 100

[REPORT]
result_path = results/evaluation_results.csv
send_email = false
recipient = user@example.com
shap_top = 10

Adult dataset example (configs/Adult/config.ini):

[GENERAL]
verbose = true
processed_dir = data/processed/
split_dir = data/splits/
results_dir = results/
report_dir = reports/
target_columns_type = cat

[DATA]
raw_path = data/raw/Adult/all.csv
domain = ADU
target = class
dataset_delimiter = ,
index_col = None
na_values = ""
size = 48842

[PREPROCESSING]
to_remove =
portion = 0.0204741921
encoding = utf-8

[SPLIT]
dataset_delimiter = ,
test_size = 0.2
random_state = 42
max_perf = 0.95
index_col = 0

[GRAPH]
layers = 1
ohe_columns = 7

[TRAINING]
cost = False
financialOption = {'amount': 'loan_amnt', 'rate': 'loan_int_rate', 'duration': 'cb_person_cred_hist_length'}
duration_divider = 12
rate_divider = 100

[REPORT]
result_path = results/evaluation_results.csv
send_email = false
recipient = user@example.com
shap_top = 10

Usage

All commands run from mlna_experiment/. Activate the venv first:

source .env_mlna/bin/activate

Quick start — all datasets

make run-all

Runs the full pipeline sequentially for all datasets (Adult, BankMarketing, CarEvaluation, CreditRiskDataset, Diabetes, GermanCredit, LoanDataforDummyBank, LoanDefaultDataset, Mushroom, Nursery, StudentPerformance).

To change the pipeline mode before running, edit STEP in the Makefile:

STEP=2 — MLNA framework mode (default)
STEP=3 — Random combinatorial search mode

Single dataset

Via Makefile:

Dataset	Target
Adult	`make mlna_on_adu`
BankMarketing	`make mlna_on_ban`
CarEvaluation	`make mlna_on_car`
CreditRiskDataset	`make mlna_on_crd`
Diabetes	`make mlna_on_dia`
GermanCredit	`make mlna_on_ger`
LoanDataforDummyBank	`make mlna_on_ld4`
LoanDefaultDataset	`make mlna_on_ldd`
Mushroom	`make mlna_on_mus`
Nursery	`make mlna_on_nur`
StudentPerformance	`make mlna_on_stu`

Via launch script:

./launch.sh <DatasetName> <STEP>
# STEP=2 : framework mode (default)
# STEP=3 : random combination search

# Examples
./launch.sh Adult 2
./launch.sh Nursery 3

Individual pipeline scripts

Scripts are always called as Python modules from mlna_experiment/:

# Step 1 — Data preprocessing
# Args: --cwd (required), --dataset_folder (required)
python3 -m scripts.01_data_preprocessing \
    --cwd=$(pwd) \
    --dataset_folder=Adult

# Step 2 — Train/test split
# Args: --cwd (required), --dataset_folder (required)
python3 -m scripts.02_data_split \
    --cwd=$(pwd) \
    --dataset_folder=Adult

# Step 3 — Graph construction
# Args: --cwd (required), --dataset_folder (required), --alpha (required), --turn (required)
#       --graph_with_class (optional), --metric (optional)
python3 -m scripts.03_graph_construction \
    --cwd=$(pwd) \
    --dataset_folder=Adult \
    --alpha=0.50 \
    --turn=1 \
    --graph_with_class

# Step 4 — Model training
# Args: --cwd (required), --dataset_folder (required), --alpha (required), --turn (required)
#       --graph_with_class (optional), --baseline (optional), --metric (optional)
python3 -m scripts.04_model_training \
    --cwd=$(pwd) \
    --dataset_folder=Adult \
    --alpha=0.50 \
    --turn=1 \
    --graph_with_class

# Step 5 — Report generation
# Args: --cwd (required), --dataset_folder (required), --metric (optional)
python3 -m scripts.05_report_generation \
    --cwd=$(pwd) \
    --dataset_folder=Adult \
    --metric=""

Flag reference:

Flag	Scripts	Type	Description
`--cwd`	01 02 03 04 05	required	Working directory (use `$(pwd)`)
`--dataset_folder`	01 02 03 04 05	required	Dataset folder name (e.g. `Adult`)
`--alpha`	03 04	required	PageRank damping factor: `0.20`, `0.50`, `0.80`
`--turn`	03 04	required	`1` = MLNA-1 · `2` = MLNA-TOP-K · `3` = MLNA-K
`--graph_with_class`	03 04	flag	Include target class nodes in the graph (CX descriptors)
`--baseline`	04	flag	Train on raw features only (no graph descriptors)
`--metric`	03 04 05	optional	Filter by metric: `"accuracy"`, `"f1-score"`, or `""` for all

Monitoring and logs

# List recent logs
make show-logs

# Follow the most recent log in real time
make tail-latest

# Summary of today's runs
make daily-summary

# List active screen sessions
screen -ls

# Attach to a session
screen -r <session_name>
# Detach: Ctrl+A then D

Stopping experiments

# Stop a specific dataset
make stop-mlna_on_adu      # Adult
make stop-mlna_on_nur      # Nursery
# etc.

# Stop all datasets
make stop-all

# Clean logs older than 7 days
make clean-old-logs

Pipeline Architecture

01_data_preprocessing
        ↓
02_data_split
        ↓
03_graph_construction   ← builds multilayer graph + extracts PageRank descriptors
        ↓                  (run in parallel per alpha via detached screen sessions)
04_model_training       ← trains 8 classifiers on classic + graph-enriched features
        ↓
05_report_generation    ← aggregates all results; produces LaTeX/PDF reports

A file model_turn_2_completed.dtvni is written at the end of step 4. Step 5 checks for this flag before aggregating results.

Script	Input	Output
`01_data_preprocessing.py`	`data/raw/<Dataset>/all.csv`	`data/processed/<Dataset>/`
`02_data_split.py`	Preprocessed data	`data/splits/<Dataset>/`
`03_graph_construction.py`	Splits + config	`results/<Dataset>/<alpha>/.../`
`04_model_training.py`	Splits + descriptors	`results/<Dataset>/<alpha>/.../`
`05_report_generation.py`	All results	`reports/`

Graph Construction Modes

Mode	Flag `--turn`	Description
MLNA-1	`1`	One monolayer per variable (each categorical column becomes a layer)
MLNA-TOP-K	`2`	Top-K variables selected from MLNA-1 results form combined layers
MLNA-K	`3`	Combinatorial — k=2 variables per layer (all pairs)

The --graph_with_class flag adds the target class as an extra graph layer, producing CX (borrower + class) vs MX (borrower only) descriptor variants.

Descriptors

For each graph layer configuration, six PageRank descriptor sets are extracted:

Code	Type	Description
`MX_GLO`	Global	PageRank on graph without class nodes — global personalization
`MX_PER`	Personalized	PageRank on graph without class nodes — per-instance personalization
`CX_GLO`	Global	PageRank on graph with class nodes — global personalization
`CX_PER`	Personalized	PageRank on graph with class nodes — per-instance personalization
`CY`	Class-only	PageRank restricted to class-layer nodes
`CXY`	All	Combined (CX + CY)

Combined feature sets assembled from the above:

GAP = GLO + PER merged
BOT = MX + CX merged (both class configurations)

Datasets

Dataset	Domain	Raw size	Config code
Adult	Census income	48 842	`ADU`
Audiology	Healthcare	192	`AUD`
BankMarketing	Marketing campaign	4 520	`BAN`
CarEvaluation	Automotive	1 727	`CAR`
CreditRiskDataset	Credit risk	32 580	`CRD`
Diabetes	Healthcare	101 765	`DIA`
GermanCredit	Credit risk	1 000	`GER`
LoanDataforDummyBank	Banking	13 648	`LD4`
LoanDefaultDataset	Banking	12 242	`LDD`
Mushroom	Biology	8 123	`MUS`
Nursery	Social	12 957	`NUR`
StudentPerformance	Education	648	`STU`

Data sources: UCI Machine Learning Repository, Kaggle.

Adding a new dataset:

Place the CSV at data/raw/<DatasetName>/all.csv
Create configs/<DatasetName>/config.ini (use the template above)
Add a Makefile target following the existing pattern (optional)
Run: ./launch.sh <DatasetName> 2

Results Layout

results/
└── <DatasetName>/
    └── <alpha>/                          # e.g. 0.20, 0.50, 0.80
        ├── cat/                          # target_columns_type
        │   ├── withClass/
        │   │   └── mlna_1/
        │   │       └── <variable>/
        │   │           ├── global/
        │   │           │   └── withClass/
        │   │           │       └── evaluation/
        │   │           │           └── *_metric_*.pkl
        │   │           ├── personalized/
        │   │           └── mixed/
        │   └── withoutClass/
        │       └── ...
        └── model_turn_2_completed.dtvni  # completion flag for step 5

Metrics evaluated per model:

Accuracy, Precision, Recall, F1-Score
ROC-AUC, PR-AUC
Cost-sensitive score: FPR × alpha + FNR × (1 − alpha)
Confusion matrix
SHAP feature importances (top-N configurable via shap_top in config)

Troubleshooting

Symptom	Fix
`venv not found`	`rm -rf .env_mlna && ./env_setup.sh`
`Permission denied` on scripts	`chmod +x launch.sh env_setup.sh kill_screens.sh stop.sh`
Dead screen sessions blocking	`screen -wipe`
`Out of memory` errors	Reduce `MAX_CORE` or increase `SIZE_DIVIDER` in `.env`
LaTeX compilation fails	`./latex_install.sh`
`ImportError` after update	`pip install -r requirements.txt --force-reinstall`
`TypeError: 'NoneType' object is not subscriptable` (step 4)	Ensure `config_df2` is loaded — see BRAINSTORMING.md §4

Log locations:

Pipeline logs: logs/*.log
Per-dataset logs: named <Dataset>_<YYYYMMDD_HHMMSS>.log

License

This project is licensed under the MIT License — see the LICENSE file for details.

Last updated: February 2026 — version tracked in CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLNA — Multi-Layer Network Analysis

Table of Contents

Overview

Project Structure

Prerequisites

Installation

1. Clone the repository

2. Install system dependencies

3. Install LaTeX (optional — required for PDF reports)

4. Set up the Python virtual environment

Configuration

1. Environment variables (`.env`)

2. Bash environment (`env.sh`)

3. Dataset configuration (`config.ini`)

Usage

Quick start — all datasets

Single dataset

Individual pipeline scripts

Monitoring and logs

Stopping experiments

Pipeline Architecture

Graph Construction Modes

Descriptors

Datasets

Results Layout

Troubleshooting

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
mlna_experiment		mlna_experiment
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MLNA — Multi-Layer Network Analysis

Table of Contents

Overview

Project Structure

Prerequisites

Installation

1. Clone the repository

2. Install system dependencies

3. Install LaTeX (optional — required for PDF reports)

4. Set up the Python virtual environment

Configuration

1. Environment variables (.env)

2. Bash environment (env.sh)

3. Dataset configuration (config.ini)

Usage

Quick start — all datasets

Single dataset

Individual pipeline scripts

Monitoring and logs

Stopping experiments

Pipeline Architecture

Graph Construction Modes

Descriptors

Datasets

Results Layout

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Environment variables (`.env`)

2. Bash environment (`env.sh`)

3. Dataset configuration (`config.ini`)

Packages