🤖 Detecting AI-Generated code vs Human written

🔍 Overview

This project investigates the detection of AI-generated code versus human-written code, addressing academic integrity concerns in educational settings where generative models (e.g., ChatGPT, Codex) are increasingly used.

Traditional plagiarism tools like MOSS and TF-IDF clustering fail to distinguish between AI vs. human authorship. This notebook builds and compares several classification pipelines that utilize both lexical (TF-IDF) and structural (AST) features, integrating a contrastive learning network with a neural classifier for enhanced discrimination.

📘 Notebook Objective

This notebook:

Loads a labeled dataset of AI-generated vs. human-written code
Extracts TF-IDF and AST-based vector features
Trains machine learning classifiers (e.g., Random Forests)
Implements a contrastive learning pipeline with triplet loss
Evaluates models using standard classification metrics (Accuracy, F1, AUC)
Visualizes learned embeddings using t-SNE and PCA

✅ Simply run the notebook top to bottom. All experiments and visualizations are included.

📂 Files Used

data.jsonl: Raw dataset with labeled code snippets
➤ Download from GitHub

📥 Data Preparation

After downloading data.jsonl, place it in the notebook directory and run:

import json
import pandas as pd

with open('data.jsonl', 'r') as file:
    lines = file.readlines()

new_data = pd.DataFrame([json.loads(line) for line in lines])
display(new_data)

This will load the dataset into a DataFrame for further processing.

🛠️ Libraries Used

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_curve, auc, classification_report, confusion_matrix
)
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import json

⚙️ How to Run

Open project_Cs6890_PLP.ipynb and run all cells.
The notebook will:
- Train and validate models using stratified 2-fold cross-validation
- Display classification metrics (Accuracy, Precision, Recall, F1-score, AUC)
- Visualize embeddings using t-SNE and PCA

📊 Evaluation

Models are evaluated using:

Accuracy
Precision
Recall
F1 Score
AUC (Area Under the ROC Curve)

🧠 Author & Notes

Developed for CS6890 - Programming Language Principles (PLP) at Utah State University

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
project_Cs6890_PLP.ipynb		project_Cs6890_PLP.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Detecting AI-Generated code vs Human written

🔍 Overview

📘 Notebook Objective

📂 Files Used

📥 Data Preparation

🛠️ Libraries Used

⚙️ How to Run

📊 Evaluation

🧠 Author & Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 Detecting AI-Generated code vs Human written

🔍 Overview

📘 Notebook Objective

📂 Files Used

📥 Data Preparation

🛠️ Libraries Used

⚙️ How to Run

📊 Evaluation

🧠 Author & Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages