sailuh · splimon · Apr 4, 2026
diff --git a/01_load_sentiment_csv_to_mysql.ipynb b/01_load_sentiment_csv_to_mysql.ipynb
@@ -0,0 +1,187 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0b0b299a",
+   "metadata": {},
+   "source": [
+    "# Load Sentiment CSV into GHTorrent MySQL (Notebook 1)\n",
+    "\n",
+    "This notebook shows how to load the GitHub Gold Standard sentiment CSV into a MySQL database that already has the GHTorrent 2004 dump. It also includes quick checks to make sure the data loaded correctly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bd368c02",
+   "metadata": {},
+   "source": [
+    "### Planned Output\n",
+    "By the end of this notebook, you should have:\n",
+    "1. A `comment_sentiment` table in MySQL\n",
+    "2. All rows from `comment_sentiment.csv` loaded\n",
+    "3. Query results that confirm row counts and valid joins to GHTorrent project data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0acf37b4",
+   "metadata": {},
+   "source": [
+    "### Step 1: Get the data ready\n",
+    "\n",
+    "1. Download the [GitHub Gold Standard dataset](https://figshare.com/articles/dataset/A_gold_standard_for_polarity_of_emotions_of_software_developers_in_GitHub/11604597?file=21001260).\n",
+    "2. Rename the file to `comment_sentiment.csv`.\n",
+    "3. Download the [GHTorrent 2004 MySQL Database Dump](https://web.archive.org/web/20150206005357/http://ghtorrent.org/msr14.html) and make sure it is already loaded in your MySQL database (example: `github`).\n",
+    "4. Make sure MySQL can read your CSV file path (e.g., `~/Desktop/github/sentiment_github_dataset/comment_sentiment.csv`)\n",
+    "\n",
+    "Optional reference: [GHTorrent schema diagram](https://web.archive.org/web/20150206005412/http://ghtorrent.org/relational.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "038f5498",
+   "metadata": {},
+   "source": [
+    "### Step 2: Create the table, load the CSV, and run the original join queries\n",
+    "\n",
+    "Use these copy-ready blocks one at a time.\n",
+    "\n",
+    "Start MySQL with local file loading turned on. Run on bash:\n",
+    "\n",
+    "```bash\n",
+    "mysql --local-infile=1 -u root -p\n",
+    "```\n",
+    "---\n",
+    "\n",
+    "Select your database (e.g., `github`):\n",
+    "\n",
+    "```sql\n",
+    "USE github;\n",
+    "```\n",
+    "---\n",
+    "\n",
+    "Drop old table if it exists (safe to re-run):\n",
+    "\n",
+    "```sql\n",
+    "DROP TABLE IF EXISTS comment_sentiment;\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Create table:\n",
+    "\n",
+    "```sql\n",
+    "CREATE TABLE comment_sentiment (\n",
+    "  ID INT NULL,\n",
+    "  Polarity VARCHAR(256) NULL,\n",
+    "  Text TEXT NULL\n",
+    ");\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Load CSV (replace with your absolute path if needed):\n",
+    "\n",
+    "```sql\n",
+    "LOAD DATA LOCAL INFILE 'comment_sentiment.csv'\n",
+    "INTO TABLE comment_sentiment\n",
+    "FIELDS TERMINATED BY ';'\n",
+    "ENCLOSED BY '\"'\n",
+    "LINES TERMINATED BY '\\n'\n",
+    "IGNORE 1 LINES\n",
+    "(ID, Polarity, Text);\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Query 1 — show joined sentiment + commit + project rows (sample view):\n",
+    "\n",
+    "```sql\n",
+    "-- Returns joined rows from sentiment comments to commit/project data\n",
+    "SELECT * FROM comment_sentiment s\n",
+    "INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
+    "INNER JOIN commits c ON c.id = cc.commit_id\n",
+    "INNER JOIN projects p ON c.project_id = p.id;\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Query 2 — count sentiment-linked comments by project name:\n",
+    "\n",
+    "```sql\n",
+    "-- Aggregates joined rows by project name and sorts by largest counts\n",
+    "SELECT name, count(name) as count FROM comment_sentiment s\n",
+    "INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
+    "INNER JOIN commits c ON c.id = cc.commit_id\n",
+    "INNER JOIN projects p ON c.project_id = p.id\n",
+    "GROUP BY name\n",
+    "ORDER BY count desc;\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f20c3d16",
+   "metadata": {},
+   "source": [
+    "### Step 3: Run validation checks\n",
+    "\n",
+    "Use these checks to confirm the load worked correctly.\n",
+    "\n",
+    "Check 1 — total rows (expected: 7122):\n",
+    "\n",
+    "```sql\n",
+    "SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
+    "```\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Check 2 — distinct comment IDs (expected: 7122):\n",
+    "\n",
+    "```sql\n",
+    "SELECT COUNT(DISTINCT ID) AS distinct_comment_ids FROM comment_sentiment;\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65b9dc8c",
+   "metadata": {},
+   "source": [
+    "### Optional troubleshooting\n",
+    "\n",
+    "If `LOAD DATA LOCAL INFILE` fails or the row count is too low:\n",
+    "\n",
+    "1. Check the row count. If it is below 7,122 comments, try the fixes below:\n",
+    "\n",
+    "```sql\n",
+    "SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
+    "```\n",
+    "\n",
+    "2. Try these fixes:\n",
+    "- Use an absolute file path in `LOAD DATA LOCAL INFILE`\n",
+    "- Make sure `--local-infile=1` is enabled\n",
+    "- Make sure the file format matches your settings (`;` delimiter and quoted text)\n",
+    "\n",
+    "3. If needed, use the following Python CSV loader script (([import_csv_to_mysql.py](https://github.com/user-attachments/files/25094159/import_csv_to_mysql.py))), then run the same checks again. This option uses Python's CSV parser and requires the installation of `mysql-connector-python`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "55c20287",
+   "metadata": {},
+   "source": [
+    "### When to move to Notebook 2\n",
+    "\n",
+    "Move to Notebook 2 only after `total_rows = 7122` and join results are greater than zero."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}