Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions 01_load_sentiment_csv_to_mysql.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0b0b299a",
"metadata": {},
"source": [
"# Load Sentiment CSV into GHTorrent MySQL (Notebook 1)\n",
"\n",
"This notebook shows how to load the GitHub Gold Standard sentiment CSV into a MySQL database that already has the GHTorrent 2004 dump. It also includes quick checks to make sure the data loaded correctly."
]
},
{
"cell_type": "markdown",
"id": "bd368c02",
"metadata": {},
"source": [
"### Planned Output\n",
"By the end of this notebook, you should have:\n",
"1. A `comment_sentiment` table in MySQL\n",
"2. All rows from `comment_sentiment.csv` loaded\n",
"3. Query results that confirm row counts and valid joins to GHTorrent project data"
]
},
{
"cell_type": "markdown",
"id": "0acf37b4",
"metadata": {},
"source": [
"### Step 1: Get the data ready\n",
"\n",
"1. Download the [GitHub Gold Standard dataset](https://figshare.com/articles/dataset/A_gold_standard_for_polarity_of_emotions_of_software_developers_in_GitHub/11604597?file=21001260).\n",
"2. Rename the file to `comment_sentiment.csv`.\n",
"3. Download the [GHTorrent 2004 MySQL Database Dump](https://web.archive.org/web/20150206005357/http://ghtorrent.org/msr14.html) and make sure it is already loaded in your MySQL database (example: `github`).\n",
"4. Make sure MySQL can read your CSV file path (e.g., `~/Desktop/github/sentiment_github_dataset/comment_sentiment.csv`)\n",
"\n",
"Optional reference: [GHTorrent schema diagram](https://web.archive.org/web/20150206005412/http://ghtorrent.org/relational.html)."
]
},
{
"cell_type": "markdown",
"id": "038f5498",
"metadata": {},
"source": [
"### Step 2: Create the table, load the CSV, and run the original join queries\n",
"\n",
"Use these copy-ready blocks one at a time.\n",
"\n",
"Start MySQL with local file loading turned on. Run on bash:\n",
"\n",
"```bash\n",
"mysql --local-infile=1 -u root -p\n",
"```\n",
"---\n",
"\n",
"Select your database (e.g., `github`):\n",
"\n",
"```sql\n",
"USE github;\n",
"```\n",
"---\n",
"\n",
"Drop old table if it exists (safe to re-run):\n",
"\n",
"```sql\n",
"DROP TABLE IF EXISTS comment_sentiment;\n",
"```\n",
"\n",
"---\n",
"\n",
"Create table:\n",
"\n",
"```sql\n",
"CREATE TABLE comment_sentiment (\n",
" ID INT NULL,\n",
" Polarity VARCHAR(256) NULL,\n",
" Text TEXT NULL\n",
");\n",
"```\n",
"\n",
"---\n",
"\n",
"Load CSV (replace with your absolute path if needed):\n",
"\n",
"```sql\n",
"LOAD DATA LOCAL INFILE 'comment_sentiment.csv'\n",
"INTO TABLE comment_sentiment\n",
"FIELDS TERMINATED BY ';'\n",
"ENCLOSED BY '\"'\n",
"LINES TERMINATED BY '\\n'\n",
"IGNORE 1 LINES\n",
"(ID, Polarity, Text);\n",
"```\n",
"\n",
"---\n",
"\n",
"Query 1 — show joined sentiment + commit + project rows (sample view):\n",
"\n",
"```sql\n",
"-- Returns joined rows from sentiment comments to commit/project data\n",
"SELECT * FROM comment_sentiment s\n",
"INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
"INNER JOIN commits c ON c.id = cc.commit_id\n",
"INNER JOIN projects p ON c.project_id = p.id;\n",
"```\n",
"\n",
"---\n",
"\n",
"Query 2 — count sentiment-linked comments by project name:\n",
"\n",
"```sql\n",
"-- Aggregates joined rows by project name and sorts by largest counts\n",
"SELECT name, count(name) as count FROM comment_sentiment s\n",
"INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
"INNER JOIN commits c ON c.id = cc.commit_id\n",
"INNER JOIN projects p ON c.project_id = p.id\n",
"GROUP BY name\n",
"ORDER BY count desc;\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f20c3d16",
"metadata": {},
"source": [
"### Step 3: Run validation checks\n",
"\n",
"Use these checks to confirm the load worked correctly.\n",
"\n",
"Check 1 — total rows (expected: 7122):\n",
"\n",
"```sql\n",
"SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
"```\n",
"\n",
"---\n",
"\n",
"Check 2 — distinct comment IDs (expected: 7122):\n",
"\n",
"```sql\n",
"SELECT COUNT(DISTINCT ID) AS distinct_comment_ids FROM comment_sentiment;\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "65b9dc8c",
"metadata": {},
"source": [
"### Optional troubleshooting\n",
"\n",
"If `LOAD DATA LOCAL INFILE` fails or the row count is too low:\n",
"\n",
"1. Check the row count. If it is below 7,122 comments, try the fixes below:\n",
"\n",
"```sql\n",
"SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
"```\n",
"\n",
"2. Try these fixes:\n",
"- Use an absolute file path in `LOAD DATA LOCAL INFILE`\n",
"- Make sure `--local-infile=1` is enabled\n",
"- Make sure the file format matches your settings (`;` delimiter and quoted text)\n",
"\n",
"3. If needed, use the following Python CSV loader script (([import_csv_to_mysql.py](https://github.com/user-attachments/files/25094159/import_csv_to_mysql.py))), then run the same checks again. This option uses Python's CSV parser and requires the installation of `mysql-connector-python`."
]
},
{
"cell_type": "markdown",
"id": "55c20287",
"metadata": {},
"source": [
"### When to move to Notebook 2\n",
"\n",
"Move to Notebook 2 only after `total_rows = 7122` and join results are greater than zero."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading