Graph Sampling Quality Prediction

This repository includes the code and data for the Graph Sampling Quality Prediction, a machine learning (ML) based method for predicting the quality and performance of different graph sampling algorithms of three categories (node-based, edge-based, and traversal).

This tutorial includes instructions for graph data preparation, processing, sampling, model training, and evaluation, providing model reproducibility. It consists of the following sections:

Installation guide
Model description
Data generation
Graph feature extraction
Graph sampling and evaluation
Mutual information analysis
Feature selection and preparation
Data analysis
Model training
Model testing
Result analysis
ML explainability
Publications

Installation guide

Required libraries:

Networkx
Networkit
SnapStanford

Model description

Sampling algorithms

Our model predicts and recommends graph sampling algorithms of three categories, based on quality and performance metrics. It considers twelve sampling algorithms listed below:

Node-based sampling
- Random node
- Random degree node
Edge-based sampling
- Random edge
- Rendom node-edge
- Induced random edge
Traversal sampling
- Random jump
- Metropolis-Hastings random walk
- SnowBall
- Forest Fire
- Frontier
- Rank degree
- Expansion

Sampling quality metrics

We consider three quality metrics (distribution divergences) for evaluating sampling algorithms: degree distribution (D3), clustering coefficient distribution (C2D2), hop-plots distribution (HPD2), and hop-plots distribution for the largest connected component (HPDC) divergence.

Sampling performance metric

We consider the execution time of graph sampling algorithms as the performance metric.

ML models

Our tool consists of three ML models:

Random forest (RF)
k nearest neighbor (kNN)
Multi-layer perceptron (MLP)

Data generation

Dataset details

Train synthetic graphs

We generated five types of synthetic graphs, with the following settings.

Albert-Barabasi These graphs consist of 64 graphs with 10,000 ~ 100,000 nodes, and densities in [0.00001, 0.001], generated with Networkx. We calculate the new edge numbers per node parameter (NewEdgesPerNode) as follows (N: node numbers, D: graph density, m: new edge numbers per node):

$$m = \lfloor(N * D / 2)\rfloor$$

m: 1~42

Watts-Strogatz Watts-Strogatz graphs include 119 graphs with 10,000 ~ 100,000 nodes and densities in [0.00001, 0.001], generated with Networkx. The following equation gives the number of neighbors of each node (|Nei|) in the ring topology, given N and D:

$$|Nei| = \lfloor(D * (N - 1))\rfloor$$

$$|Nei|$$: 2~98

Erdos-Renyi These graphs consist of 173 graphs with 10,000 ~ 100,000 nodes and edge probabilities (densities) in [0.00001, 0.001], generated with Networkx.

PowerLawCluster PowerLawCluster graphs of Networkx include 60 graphs with adjusted parameters (new edge numbers per node m and TriangleProb) to produce densities in [0.00001, 0.001] and average clustering coefficient (CC) in [0.1, 0.6] extracted from real graph properties.

#Nodes	m	TriangleProb
10,000	2	0.2, 0.3, 0.4, 0.5, 0.6
10,000	3	0.2, 0.4, 0.6, 0.8, 0.9, 1
10,000	5	0.4, 0.6, 0.8, 1
15,000	2	0.2, 0.5, 0.7
15,000	5	0.3, 0.7, 1
15,000	10	0.5, 1
20,000	2	0.2, 0.4, 0.6, 0.8
20,000	6	0.6, 0.8, 1
20,000	8	0.6, 1
20,000	10	0.5, 1
25,000	2	0.2, 0.4, 0.6, 0.8
25,000	5	0.4, 0.8, 1
25,000	15	0.7, 1
30,000	2	0.2, 0.4, 0.6, 0.8
30,000	6	0.5, 1
30,000	15	0.8, 1
35,000	2	0.2, 0.4, 0.8
35,000	10	0.5, 1
35,000	20	0.9, 1
40,000	2	0.2, 0.5, 0.8
40,000	5	0.3, 0.5, 0.7, 1
40,000	10	0.7, 1
40,000	20	0.9, 1
45,000	2	0.2, 0.6, 0.8
45,000	10	0.9, 1
45,000	20	1

Forest-Fire The evolution-based Forest-Fire graphs consist of 36 graphs with 10,000 ~ 100,000 nodes and adjusted forward/backward probabilities to produce densities in [0.00004, 0.001]. We used Snap-stanford library for their generation.

#Nodes	forward/backward probability
10,000	0.1, 0.2, 0.3
15,000	0.1, 0.2, 0.3
20,000	0.1, 0.2, 0.3
25,000	0.01, 0.1, 0.2, 0.3
30,000	0.01, 0.1, 0.2, 0.3
35,000	0.01, 0.1, 0.2, 0.3
40,000	0.01, 0.1, 0.2, 0.3
45,000	0.01, 0.1, 0.2, 0.3
50,000	0.01, 0.2, 0.3

Stochastic Block Model The clustering based graphs of the stochastic block model consist of 180 graphs with the following characteristics, generated with Networkx ($K$: number of clusters, $\alpha$: inter-intra cluster probability ratio):

$$N$$: 5,000, 15,000, 25,000, 35,000, 45,000

$$D$$: 0.0001, 0.0005, 0.0009

$$K$$: 10, 15, 20, 25

$$\alpha$$: 0.01, 0.001, 0.1

We calculated parameters of these graphs (cluster size $$N_c$$, inter cluster density $$\rho'$$ and intra cluster density $$\rho$$) according to the following equations :

$$\rho = (D * (N-1) * K) / (N - K + \alpha * N / (K - 2))$$

$$\rho' = \alpha * \rho$$

$$N_c = \lfloor N/K\rfloor$$

Test synthetic graphs

We generated 35 testing graphs of three types Albert-Barabasi, Erdos-Renyi, and Watts-Strogatz with 150,000-450,000 nodes and the following parameters:

Albert-Barabasi with m of 1-22;
Watts-Strogatz with $$|Nei|$$ of 2-44;
Erdos-Renyi with edge probabilities in [0.000008,0.0001].

Real graphs

We also tested our models on 16 publicly available real-world graphs of co-authorship, citation, technology, collaboration, and social graphs.

Data generation instructions

To generate synthetic graphs inside folder "data preparation/graph generation" run the following command:

bash run_generate_synthetic_graphs -dataset_num data_folder_number

It generates synthetic graphs with the characteristics definded in dataset_info file of the specified folder number.

Graph feature extraction

This script extracts some time-consuming features i.e. node/edge betweenness, eigenvector, pagerank, clustering coefficient, components sizes, maximum spannig tree degrees, shortest path lengths and assortativity with their statistics and distributions.

Execution instruction

The following command to extracts the major features of all graphs in the specified folder:

bash run_parallel_major_feature_extraction_multi_graphs

Graph sampling and evaluation

It includes sampling from several graphs and evaluating samples under three quality metrics (degree, clustering coefficent and hop-plots distribution divergences) and execution time.

Execution instruction

The following command collects samples (in parallel) from the graphs in the specified folder with the desired sampling settings such as sampling algorithms and number of iterations:

bash sample_from_several_large_graphs

Then running this command evaluates samples under quality metrics.

bash run_parallel_processing_samples_multi_graph

Mutual information analysis and feature selection

Mutual information (MI) analysis in notebook samplings_analysis_MI.ipynb uses sklearn for selecting the most relevant features for each metric (quality and performance), having MI scores higher than 0.99 with at least one sampling algorithm. Concatenation of these features with sampling features (12 dimensional 1-hot vector for sampling algorithm, 3 dimensions for 1-hot vector of algorithm type and 1 dimention for rate) constitutes the input feature vector for ML models.

Feature normalization

We normalize all graphs features using maximum and exponential-logarithmic (EL) applied for the following statistics:

Maximum normalization: minimum, average and medium values
EL normalization: maximum and variance values, calculation times, and raw features.

The notebook create_datasets.ipynb includes the steps for normalization.

Data analysis

The notebook data_analysis.ipynb includes data analysis steps including duplicate train/test data elimination and visualization of train/test data features using t-SNE.

Model training

Model training consists of constructing feature vectors for each sampling metric and training three ML models, RF, MLP and kNN with tuning their hyper-parameters (see notebook train_models.ipynb).

Hyper-parameter tuning

We performed five-fold cross validation using GreadSearchCV with following search space for different hyper-parameters of ML models, which results in the following hyper-parameters for each ML model.

Model	Hyperparameter	Search Space	Tuned value (D3/C2D2/HPD2/HPDC)
kNN	algorithm	'auto', 'ball_tree', 'kd_tree', 'brute'	'auto', 'ball_tree', 'ball_tree', 'ball_tree'
kNN	n_neighbors	4, 5, 10, 15	4, 4, 15, 10
kNN	weights	'uniform', 'distance'	'distance', 'distance', 'distance', 'distance'
RF	bootstrap	True	True
RF	max_depth	None, 90, 100, 110	None, None, None, None
RF	max_features	2, 3	3, 3, 3, 3
RF	min_samples_leaf	2, 3, 4, 5	2, 2, 2, 2
RF	min_samples_split	8, 10, 12	8, 8, 8, 8
RF	n_estimators	100, 200, 300, 400	100, 400, 400, 400
MLP	activation	'tanh', 'relu'	'tanh', 'tanh', 'relu', 'relu'
MLP	hidden_layer_sizes	(30), (50), (100), (30,30), (50,50), (100,100)	(100,100), (30,30), (100,100), (50,50)
MLP	solver	'sgd', 'adam'	'adam'
MLP	alpha	0.0001, 0.05	0.0001
MLP	learning_rate	'constant', 'adaptive'	'constant'
MLP	shuffle	True	True
MLP	early_stopping	True	True

To run the model training and hyper-parameter tuning use train_models/train_models.ipynb.

Model testing

The notebook test_models/test_models.ipynb tests the trained models on the test set.

Result analysis

The notebook result_analysis/result_analysis.ipynb calculates RMSE of predictions for all metrics and compares top-k ranking accuracy in terms of Hits@k of ML models with two baseline methods random selection and k-best selection.

ML explainability

The notebook result_analysis/ML-explainability ranks different input graph and sampling features according to their calculated importance using LIME library. It provides average rankings for each ML model and quality metric.

Publications

This repository is the implementation of the following paper:

S. H. S. Dizaji, R. Farahani, J. M. Rožanec, D. Kimovski, A. Soylu and R. Prodan, "Graph Sampling Quality Prediction for Algorithm Recommendation," 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), Bangalore, India, 2024, pp. 243-254, doi: 10.1109/HiPC62374.2024.00032.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets/images		assets/images
data preparation		data preparation
datasets		datasets
result_analysis		result_analysis
test_models		test_models
train_models		train_models
.~lock.README.md#		.~lock.README.md#
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph Sampling Quality Prediction

Installation guide

Model description

Sampling algorithms

Sampling quality metrics

Sampling performance metric

ML models

Data generation

Dataset details

Train synthetic graphs

Test synthetic graphs

Real graphs

Data generation instructions

Graph feature extraction

Execution instruction

Graph sampling and evaluation

Execution instruction

Mutual information analysis and feature selection

Feature normalization

Data analysis

Model training

Hyper-parameter tuning

Model testing

Result analysis

ML explainability

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph Sampling Quality Prediction

Installation guide

Model description

Sampling algorithms

Sampling quality metrics

Sampling performance metric

ML models

Data generation

Dataset details

Train synthetic graphs

Test synthetic graphs

Real graphs

Data generation instructions

Graph feature extraction

Execution instruction

Graph sampling and evaluation

Execution instruction

Mutual information analysis and feature selection

Feature normalization

Data analysis

Model training

Hyper-parameter tuning

Model testing

Result analysis

ML explainability

Publications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages