diff --git a/01_load_sentiment_csv_to_mysql.ipynb b/01_load_sentiment_csv_to_mysql.ipynb
new file mode 100644
index 0000000..78548ab
--- /dev/null
+++ b/01_load_sentiment_csv_to_mysql.ipynb
@@ -0,0 +1,187 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "0b0b299a",
+ "metadata": {},
+ "source": [
+ "# Load Sentiment CSV into GHTorrent MySQL (Notebook 1)\n",
+ "\n",
+ "This notebook shows how to load the GitHub Gold Standard sentiment CSV into a MySQL database that already has the GHTorrent 2004 dump. It also includes quick checks to make sure the data loaded correctly."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bd368c02",
+ "metadata": {},
+ "source": [
+ "### Planned Output\n",
+ "By the end of this notebook, you should have:\n",
+ "1. A `comment_sentiment` table in MySQL\n",
+ "2. All rows from `comment_sentiment.csv` loaded\n",
+ "3. Query results that confirm row counts and valid joins to GHTorrent project data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0acf37b4",
+ "metadata": {},
+ "source": [
+ "### Step 1: Get the data ready\n",
+ "\n",
+ "1. Download the [GitHub Gold Standard dataset](https://figshare.com/articles/dataset/A_gold_standard_for_polarity_of_emotions_of_software_developers_in_GitHub/11604597?file=21001260).\n",
+ "2. Rename the file to `comment_sentiment.csv`.\n",
+ "3. Download the [GHTorrent 2004 MySQL Database Dump](https://web.archive.org/web/20150206005357/http://ghtorrent.org/msr14.html) and make sure it is already loaded in your MySQL database (example: `github`).\n",
+ "4. Make sure MySQL can read your CSV file path (e.g., `~/Desktop/github/sentiment_github_dataset/comment_sentiment.csv`)\n",
+ "\n",
+ "Optional reference: [GHTorrent schema diagram](https://web.archive.org/web/20150206005412/http://ghtorrent.org/relational.html)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "038f5498",
+ "metadata": {},
+ "source": [
+ "### Step 2: Create the table, load the CSV, and run the original join queries\n",
+ "\n",
+ "Use these copy-ready blocks one at a time.\n",
+ "\n",
+ "Start MySQL with local file loading turned on. Run on bash:\n",
+ "\n",
+ "```bash\n",
+ "mysql --local-infile=1 -u root -p\n",
+ "```\n",
+ "---\n",
+ "\n",
+ "Select your database (e.g., `github`):\n",
+ "\n",
+ "```sql\n",
+ "USE github;\n",
+ "```\n",
+ "---\n",
+ "\n",
+ "Drop old table if it exists (safe to re-run):\n",
+ "\n",
+ "```sql\n",
+ "DROP TABLE IF EXISTS comment_sentiment;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Create table:\n",
+ "\n",
+ "```sql\n",
+ "CREATE TABLE comment_sentiment (\n",
+ " ID INT NULL,\n",
+ " Polarity VARCHAR(256) NULL,\n",
+ " Text TEXT NULL\n",
+ ");\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Load CSV (replace with your absolute path if needed):\n",
+ "\n",
+ "```sql\n",
+ "LOAD DATA LOCAL INFILE 'comment_sentiment.csv'\n",
+ "INTO TABLE comment_sentiment\n",
+ "FIELDS TERMINATED BY ';'\n",
+ "ENCLOSED BY '\"'\n",
+ "LINES TERMINATED BY '\\n'\n",
+ "IGNORE 1 LINES\n",
+ "(ID, Polarity, Text);\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Query 1 — show joined sentiment + commit + project rows (sample view):\n",
+ "\n",
+ "```sql\n",
+ "-- Returns joined rows from sentiment comments to commit/project data\n",
+ "SELECT * FROM comment_sentiment s\n",
+ "INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
+ "INNER JOIN commits c ON c.id = cc.commit_id\n",
+ "INNER JOIN projects p ON c.project_id = p.id;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Query 2 — count sentiment-linked comments by project name:\n",
+ "\n",
+ "```sql\n",
+ "-- Aggregates joined rows by project name and sorts by largest counts\n",
+ "SELECT name, count(name) as count FROM comment_sentiment s\n",
+ "INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
+ "INNER JOIN commits c ON c.id = cc.commit_id\n",
+ "INNER JOIN projects p ON c.project_id = p.id\n",
+ "GROUP BY name\n",
+ "ORDER BY count desc;\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f20c3d16",
+ "metadata": {},
+ "source": [
+ "### Step 3: Run validation checks\n",
+ "\n",
+ "Use these checks to confirm the load worked correctly.\n",
+ "\n",
+ "Check 1 — total rows (expected: 7122):\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Check 2 — distinct comment IDs (expected: 7122):\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(DISTINCT ID) AS distinct_comment_ids FROM comment_sentiment;\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "65b9dc8c",
+ "metadata": {},
+ "source": [
+ "### Optional troubleshooting\n",
+ "\n",
+ "If `LOAD DATA LOCAL INFILE` fails or the row count is too low:\n",
+ "\n",
+ "1. Check the row count. If it is below 7,122 comments, try the fixes below:\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(*) AS total_rows FROM comment_sentiment;\n",
+ "```\n",
+ "\n",
+ "2. Try these fixes:\n",
+ "- Use an absolute file path in `LOAD DATA LOCAL INFILE`\n",
+ "- Make sure `--local-infile=1` is enabled\n",
+ "- Make sure the file format matches your settings (`;` delimiter and quoted text)\n",
+ "\n",
+ "3. If needed, use the following Python CSV loader script (([import_csv_to_mysql.py](https://github.com/user-attachments/files/25094159/import_csv_to_mysql.py))), then run the same checks again. This option uses Python's CSV parser and requires the installation of `mysql-connector-python`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "55c20287",
+ "metadata": {},
+ "source": [
+ "### When to move to Notebook 2\n",
+ "\n",
+ "Move to Notebook 2 only after `total_rows = 7122` and join results are greater than zero."
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/02_explore_gh_torrent_tables.ipynb b/02_explore_gh_torrent_tables.ipynb
new file mode 100644
index 0000000..b352660
--- /dev/null
+++ b/02_explore_gh_torrent_tables.ipynb
@@ -0,0 +1,237 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "1b3dd1e0",
+ "metadata": {},
+ "source": [
+ "# Explore GHTorrent Tables for Sentiment Mapping (Notebook 2)\n",
+ "\n",
+ "This notebook helps you understand where sentiment-labeled comments are stored in GHTorrent and how they connect to projects. These checks are for exploration and validation. You do not need to run every query to run the end-to-end workflow."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "57b61bad",
+ "metadata": {},
+ "source": [
+ "### Planned Output\n",
+ "By the end of this notebook, you should have:\n",
+ "1. A clear view of how sentiment comments are split across commit vs PR comment tables\n",
+ "2. A ranked list of projects with sentiment-labeled commit comments\n",
+ "3. A ranked list of projects with sentiment-labeled PR comments\n",
+ "4. A global summary of comments reachable from canonical repos vs forks"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ac09f08e",
+ "metadata": {},
+ "source": [
+ "### Check 1: How sentiment comments are distributed\n",
+ "\n",
+ "Use these queries to see how many sentiment comments are in commit comments, PR comments, and both tables.\n",
+ "\n",
+ "Count sentiment comments in `commit_comments` (expected: 4317):\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(*)\n",
+ "FROM comment_sentiment s\n",
+ "INNER JOIN commit_comments cc ON s.ID = cc.comment_id;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Count sentiment comments in `pull_request_comments` (expected: 2890):\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(*)\n",
+ "FROM comment_sentiment s\n",
+ "INNER JOIN pull_request_comments prc ON s.ID = prc.comment_id;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Count overlap that appears in both tables (expected: 85):\n",
+ "\n",
+ "```sql\n",
+ "SELECT COUNT(*) AS both_tables\n",
+ "FROM comment_sentiment s\n",
+ "INNER JOIN commit_comments cc ON s.ID = cc.comment_id\n",
+ "INNER JOIN pull_request_comments prc ON s.ID = prc.comment_id;\n",
+ "```\n",
+ "\n",
+ "Quick interpretation:\n",
+ "- Commit-only = 4317 - 85 = 4232\n",
+ "- PR-only = 2890 - 85 = 2805\n",
+ "- Both = 85\n",
+ "- Total unique comments = 7122"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f6cee0f8",
+ "metadata": {},
+ "source": [
+ "### Check 2: Projects with the most sentiment-labeled commit comments\n",
+ "\n",
+ "Use this to rank projects by number of labeled commit comments.\n",
+ "\n",
+ "```sql\n",
+ "SELECT p.id, p.name, p.url, COUNT(DISTINCT s.ID) AS labeled_comment_count\n",
+ "FROM projects p\n",
+ "INNER JOIN commits c ON p.id = c.project_id\n",
+ "INNER JOIN commit_comments cc ON c.id = cc.commit_id\n",
+ "INNER JOIN comment_sentiment s ON cc.comment_id = s.ID\n",
+ "GROUP BY p.id, p.name, p.url\n",
+ "ORDER BY labeled_comment_count DESC;\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "Use this to inspect example rows for one project (replace `{owner}` and `{repo}`):\n",
+ "\n",
+ "```sql\n",
+ "SELECT c.sha, p.url, p.name, s.ID AS comment_id, s.Text AS comment_text\n",
+ "FROM commits c\n",
+ "INNER JOIN projects p ON c.project_id = p.id\n",
+ "INNER JOIN commit_comments cc ON c.id = cc.commit_id\n",
+ "INNER JOIN comment_sentiment s ON cc.comment_id = s.ID\n",
+ "WHERE p.url = 'https://api.github.com/repos/{owner}/{repo}';\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a37a5ebc",
+ "metadata": {},
+ "source": [
+ "### Check 3: Projects with the most sentiment-labeled PR comments\n",
+ "\n",
+ "Use this to rank projects by number of labeled PR comments.\n",
+ "\n",
+ "```sql\n",
+ "SELECT p.id, p.name, p.url, COUNT(DISTINCT s.ID) AS labeled_comment_count\n",
+ "FROM projects p\n",
+ "INNER JOIN pull_requests pr ON p.id = pr.base_repo_id\n",
+ "INNER JOIN pull_request_comments prc ON pr.id = prc.pull_request_id\n",
+ "INNER JOIN comment_sentiment s ON prc.comment_id = s.ID\n",
+ "GROUP BY p.id, p.name, p.url\n",
+ "ORDER BY labeled_comment_count DESC;\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a6a740e8",
+ "metadata": {},
+ "source": [
+ "### Check 4: Canonical repo vs fork accessibility summary\n",
+ "\n",
+ "This query estimates how many sentiment comments are reachable from canonical repos vs only from forks.\n",
+ "\n",
+ "```sql\n",
+ "WITH RECURSIVE project_root AS (\n",
+ " SELECT p.id AS project_id, p.id AS root_id\n",
+ " FROM projects p\n",
+ " WHERE p.forked_from IS NULL\n",
+ " UNION ALL\n",
+ " SELECT c.id AS project_id, pr.root_id\n",
+ " FROM projects c\n",
+ " JOIN project_root pr ON c.forked_from = pr.project_id\n",
+ "),\n",
+ "comment_project_rows AS (\n",
+ " SELECT cs.ID AS comment_id, c.project_id, 'commit_comment' AS source_tag\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN commit_comments cc ON cs.ID = cc.comment_id\n",
+ " JOIN commits c ON cc.commit_id = c.id\n",
+ "\n",
+ " UNION ALL\n",
+ "\n",
+ " SELECT cs.ID AS comment_id, pr.base_repo_id AS project_id, 'pr_comment' AS source_tag\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN pull_request_comments prc ON cs.ID = prc.comment_id\n",
+ " JOIN pull_requests pr ON prc.pull_request_id = pr.id\n",
+ "\n",
+ " UNION ALL\n",
+ "\n",
+ " SELECT cs.ID AS comment_id, pr.head_repo_id AS project_id, 'pr_comment' AS source_tag\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN pull_request_comments prc ON cs.ID = prc.comment_id\n",
+ " JOIN pull_requests pr ON prc.pull_request_id = pr.id\n",
+ "),\n",
+ "labeled AS (\n",
+ " SELECT\n",
+ " cpr.comment_id,\n",
+ " cpr.source_tag,\n",
+ " pr.root_id,\n",
+ " (cpr.project_id = pr.root_id) AS is_canonical\n",
+ " FROM comment_project_rows cpr\n",
+ " JOIN project_root pr ON pr.project_id = cpr.project_id\n",
+ "),\n",
+ "comment_flags AS (\n",
+ " SELECT\n",
+ " root_id,\n",
+ " source_tag,\n",
+ " comment_id,\n",
+ " MAX(CASE WHEN is_canonical THEN 1 ELSE 0 END) AS has_canonical,\n",
+ " MAX(CASE WHEN NOT is_canonical THEN 1 ELSE 0 END) AS has_fork\n",
+ " FROM labeled\n",
+ " GROUP BY root_id, source_tag, comment_id\n",
+ "),\n",
+ "global_counts AS (\n",
+ " SELECT\n",
+ " COUNT(*) AS mapped_comment_ids,\n",
+ " SUM(CASE WHEN has_canonical = 1 THEN 1 ELSE 0 END) AS canonical_accessible,\n",
+ " SUM(CASE WHEN has_fork = 1 THEN 1 ELSE 0 END) AS fork_accessible,\n",
+ " SUM(CASE WHEN has_canonical = 1 AND has_fork = 0 THEN 1 ELSE 0 END) AS canonical_only,\n",
+ " SUM(CASE WHEN has_canonical = 0 AND has_fork = 1 THEN 1 ELSE 0 END) AS fork_only,\n",
+ " SUM(CASE WHEN has_canonical = 1 AND has_fork = 1 THEN 1 ELSE 0 END) AS both_sides\n",
+ " FROM comment_flags\n",
+ ")\n",
+ "SELECT\n",
+ " mapped_comment_ids,\n",
+ " canonical_accessible,\n",
+ " fork_accessible,\n",
+ " canonical_only,\n",
+ " fork_only,\n",
+ " both_sides,\n",
+ " ROUND(100 * fork_only / NULLIF(mapped_comment_ids, 0), 2) AS fork_only_pct,\n",
+ " ROUND(100 * canonical_only / NULLIF(mapped_comment_ids, 0), 2) AS canonical_only_pct,\n",
+ " ROUND(100 * (canonical_only + both_sides) / NULLIF(mapped_comment_ids, 0), 2) AS canonical_reachable_pct\n",
+ "FROM global_counts;\n",
+ "```\n",
+ "\n",
+ "Expected values from prior runs:\n",
+ "- `canonical_only`: 4555\n",
+ "- `fork_only`: 569\n",
+ "- `both_sides`: 2083\n",
+ "- Canonical reachable rate: about 93.2%"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "52c9ee7e",
+ "metadata": {},
+ "source": [
+ "### When to move on to Notebook 3\n",
+ "\n",
+ "You can move to Notebook 3 when all of these are true:\n",
+ "\n",
+ "1. Check 1 totals are consistent (commit + PR - overlap = 7122).\n",
+ "2. Check 2 returns project rows for commit-comment mappings (not empty).\n",
+ "3. Check 3 returns project rows for PR-comment mappings (not empty).\n",
+ "4. Check 4 runs successfully and shows non-zero canonical reachability.\n",
+ "\n",
+ "If any check is empty or fails, fix the data/join issue first before moving on."
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/03_scale_config_files.ipynb b/03_scale_config_files.ipynb
new file mode 100644
index 0000000..2c82d59
--- /dev/null
+++ b/03_scale_config_files.ipynb
@@ -0,0 +1,403 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6fa72e9d",
+ "metadata": {},
+ "source": [
+ "# Scale and Automate Config Generation (Notebook 3)\n",
+ "\n",
+ "This notebook generates Kaiaulu config files for each main project repo in the GHTorrent database.\n",
+ "\n",
+ "**What this notebook does:**\n",
+ "1. Queries MySQL/GHTorrent to identify canonical repos with sentiment-labeled comments\n",
+ "2. Generates a `.yml` config file per repo (using `trinitycore.yml` as a template) and writes them to Kaiaulu's `conf/` directory\n",
+ "\n",
+ "**What comes next** — once configs are written, use these Kaiaulu vignettes to download and parse comments:\n",
+ "- `vignettes/download_github_events.Rmd` → commit comments\n",
+ "- `vignettes/download_github_pull_request_comments.Rmd` → PR inline comments"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a18a63e8",
+ "metadata": {},
+ "source": [
+ "### Planned Output\n",
+ "\n",
+ "1. One `.yml` config file per main project repo in the GHTorrent database, written to Kaiaulu's `conf/` directory."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622cb929",
+ "metadata": {},
+ "source": [
+ "### Step 1: Import Dependencies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1bc36cfe",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import subprocess\n",
+ "from pathlib import Path\n",
+ "\n",
+ "import pandas as pd\n",
+ "import yaml\n",
+ "from sqlalchemy import create_engine, text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c2d1ae1f",
+ "metadata": {},
+ "source": [
+ "### Step 2: Set Paths and Configuration\n",
+ "\n",
+ "Update the variables below before running:\n",
+ "- **`KAIAULU_REPO`** — path to your local Kaiaulu repo\n",
+ "- **`MYSQL_DB`** / **`MYSQL_PASSWORD`** — your database credentials\n",
+ "- **`MAX_REPOS`** — set to an integer to limit the number of repos processed, or `None` to process all\n",
+ "- **`WRITE_CONFIGS`** — set to `False` to do a dry run without writing any files"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20fb9b60",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Paths\n",
+ "KAIAULU_REPO = (Path(\".\").resolve() / \"..\" / \"kaiaulu\").resolve()\n",
+ "\n",
+ "# Kaiaulu-owned inputs/outputs\n",
+ "CONF_DIR = KAIAULU_REPO / \"conf\"\n",
+ "TEMPLATE_PATH = CONF_DIR / \"trinitycore.yml\"\n",
+ "\n",
+ "# Repo selection cap (None = all main project repos)\n",
+ "MAX_REPOS = None\n",
+ "\n",
+ "# MySQL connection (override with env vars if needed)\n",
+ "MYSQL_HOST = os.getenv(\"MYSQL_HOST\", \"localhost\")\n",
+ "MYSQL_PORT = int(os.getenv(\"MYSQL_PORT\", \"3306\"))\n",
+ "MYSQL_DB = os.getenv(\"MYSQL_DB\", \"ADD_DB_NAME_HERE\")\n",
+ "MYSQL_USER = os.getenv(\"MYSQL_USER\", \"root\")\n",
+ "MYSQL_PASSWORD = os.getenv(\"MYSQL_PASSWORD\", \"ADD_PASSWORD_HERE\")\n",
+ "\n",
+ "# Toggle writing config files to Kaiaulu conf/\n",
+ "WRITE_CONFIGS = True"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aa923139",
+ "metadata": {},
+ "source": [
+ "### Step 3: Query Canonical Repos from GHTorrent\n",
+ "\n",
+ "Queries MySQL to find main (non-fork) repos that have at least one sentiment-labeled comment (commit or PR). Results are loaded into `repos`.\n",
+ "\n",
+ "Expected output (~82 repos):\n",
+ "\n",
+ "| | owner | repo |\n",
+ "|---|---|---|\n",
+ "| 0 | akka | akka |\n",
+ "| 1 | antirez | redis |\n",
+ "| 2 | ariya | phantomjs |\n",
+ "| 3 | automapper | automapper |\n",
+ "| 4 | bartaz | impress.js |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5641db76",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "canonical repos found: 82\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " owner | \n",
+ " repo | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " akka | \n",
+ " akka | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " antirez | \n",
+ " redis | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " ariya | \n",
+ " phantomjs | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " automapper | \n",
+ " automapper | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " bartaz | \n",
+ " impress.js | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " owner repo\n",
+ "0 akka akka\n",
+ "1 antirez redis\n",
+ "2 ariya phantomjs\n",
+ "3 automapper automapper\n",
+ "4 bartaz impress.js"
+ ]
+ },
+ "execution_count": 69,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Query canonical repos that have sentiment-labeled comments\n",
+ "engine = create_engine(\n",
+ " f'mysql+pymysql://{MYSQL_USER}:{MYSQL_PASSWORD}@{MYSQL_HOST}:{MYSQL_PORT}/{MYSQL_DB}'\n",
+ " )\n",
+ "\n",
+ "sql = \"\"\"\n",
+ "WITH RECURSIVE project_root AS (\n",
+ " SELECT p.id AS project_id, p.id AS root_id\n",
+ " FROM projects p\n",
+ " WHERE p.forked_from IS NULL\n",
+ " UNION ALL\n",
+ " SELECT c.id AS project_id, pr.root_id\n",
+ " FROM projects c\n",
+ " JOIN project_root pr ON c.forked_from = pr.project_id\n",
+ "),\n",
+ "comment_project_rows AS (\n",
+ " SELECT cs.ID AS comment_id, c.project_id\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN commit_comments cc ON cs.ID = cc.comment_id\n",
+ " JOIN commits c ON cc.commit_id = c.id\n",
+ " UNION ALL\n",
+ " SELECT cs.ID AS comment_id, pr.base_repo_id AS project_id\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN pull_request_comments prc ON cs.ID = prc.comment_id\n",
+ " JOIN pull_requests pr ON prc.pull_request_id = pr.id\n",
+ " UNION ALL\n",
+ " SELECT cs.ID AS comment_id, pr.head_repo_id AS project_id\n",
+ " FROM comment_sentiment cs\n",
+ " JOIN pull_request_comments prc ON cs.ID = prc.comment_id\n",
+ " JOIN pull_requests pr ON prc.pull_request_id = pr.id\n",
+ ")\n",
+ "SELECT DISTINCT LOWER(u.login) AS owner, LOWER(p.name) AS repo\n",
+ "FROM comment_project_rows cpr\n",
+ "JOIN project_root pr ON pr.project_id = cpr.project_id\n",
+ "JOIN projects p ON p.id = pr.root_id\n",
+ "JOIN users u ON u.id = p.owner_id\n",
+ "ORDER BY owner, repo\n",
+ "\"\"\"\n",
+ "\n",
+ "repos = pd.read_sql(text(sql), con=engine)\n",
+ "print('repos found:', len(repos))\n",
+ "repos.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "46104941",
+ "metadata": {},
+ "source": [
+ "### Step 4: Generate and Write Config Files\n",
+ "\n",
+ "Builds a `.yml` config file for each repo using `trinitycore.yml` as a template and writes it to Kaiaulu's `conf/` directory.\n",
+ "\n",
+ "Each config follows this structure:\n",
+ "```yaml\n",
+ "project:\n",
+ " website: https://github.com/{owner}/{repo}\n",
+ "issue_tracker:\n",
+ " github:\n",
+ " project_key_1:\n",
+ " owner: {owner}\n",
+ " repo: {repo}\n",
+ " issue_or_pr_comment: rawdata/github/{owner}/{repo}/issue_or_pr_comment/\n",
+ " issue_event: rawdata/github/{owner}/{repo}/issue_event/\n",
+ " commit: rawdata/github/{owner}/{repo}/commit/\n",
+ " commit_comments: rawdata/github/{owner}/{repo}/commit_comments/\n",
+ " pr_comments: rawdata/github/{owner}/{repo}/pr_comments/\n",
+ "```\n",
+ "\n",
+ "Expected output: a list of written `.yml` filenames, e.g. `['akka.yml', 'redis.yml', ...]`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7a926ed1",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "repos selected for config generation: 82\n",
+ "written configs: ['akka.yml', 'redis.yml', 'phantomjs.yml', 'automapper.yml', 'impress.js.yml', 'bitcoin.yml', 'boto.yml', 'craftbukkit.yml', 'cakephp.yml', 'compass.yml', 'clojure.yml', 'slim.yml', 'diaspora.yml', 'django-cms.yml', 'django.yml', 'django-debug-toolbar.yml', 'elasticsearch.yml', 'codeigniter.yml', 'facebook-android-sdk.yml', 'folly.yml', 'hiphop-php.yml', 'php-sdk.yml', 'tornado.yml', 'thinkup.yml', 'android.yml', 'gitlabhq.yml', 'html5-boilerplate.yml', 'devtools.yml', 'chosen.yml', 'sparkleshare.yml', 'octopress.yml', 'actionbarsherlock.yml', 'blueprint-css.yml', 'http-parser.yml', 'libuv.yml', 'node.yml', 'jquery.yml', 'requests.yml', 'beanstalkd.yml', 'libgit2.yml', 'ccv.yml', 'mangos.yml', 'd3.yml', 'memcached.yml', 'sick-beard.yml', 'flask.yml', 'jekyll.yml', 'mongo.yml', 'mono.yml', 'plupload.yml', 'three.js.yml', 'homebrew.yml', 'nancy.yml', 'storm.yml', 'netty.yml', 'openframeworks.yml', 'devise.yml', 'rails.yml', 'reddit.yml', 'restsharp.yml', 'kestrel.yml', 'shiny.yml', 'miniprofiler.yml', 'sbt.yml', 'scala.yml', 'scalatra.yml', 'phpunit.yml', 'servicestack.yml', 'signalr.yml', 'symfony.yml', 'paperclip.yml', 'trinitycore.yml', 'finagle.yml', 'flockdb.yml', 'gizzard.yml', 'zipkin.yml', 'redcarpet.yml', 'xbmc.yml', 'symfony.yml', 'knitr.yml', 'zf2.yml', 'foundation.yml']\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Build YAML configs for 82 project repos using trinitycore.yml as the base template\n",
+ "header_lines = [\n",
+ " \"# -*- yaml -*-\",\n",
+ " \"# https://github.com/sailuh/kaiaulu\",\n",
+ " \"#\",\n",
+ " \"# Copying and distribution of this file, with or without modification,\",\n",
+ " \"# are permitted in any medium without royalty provided the copyright\",\n",
+ " \"# notice and this notice are preserved. This file is offered as-is,\",\n",
+ " \"# without any warranty.\",\n",
+ " \"\",\n",
+ " \"# Project Configuration File #\",\n",
+ " \"#\",\n",
+ " \"# To perform analysis on open source projects, you need to manually\",\n",
+ " \"# collect some information from the project's website. As there is\",\n",
+ " \"# no standardized website format, this file serves to distill\",\n",
+ " \"# important data source information so it can be reused by others\",\n",
+ " \"# and understood by Kaiaulu.\",\n",
+ " \"#\",\n",
+ " \"# Please check https://github.com/sailuh/kaiaulu/tree/master/conf to\",\n",
+ " \"# see if a project configuration file already exists. Otherwise, we\",\n",
+ " \"# would appreciate if you share your curated file with us by sending a\",\n",
+ " \"# Pull Request: https://github.com/sailuh/kaiaulu/pulls\",\n",
+ " \"#\",\n",
+ " \"# Note, you do NOT need to specify this entire file to conduct analysis.\",\n",
+ " \"# Each R Notebook uses a different portion of this file. To know what\",\n",
+ " \"# information is used, see the project configuration file section at\",\n",
+ " \"# the start of each R Notebook.\",\n",
+ " \"#\",\n",
+ " \"# Please comment unused parameters instead of deleting them for clarity.\",\n",
+ " \"# If you have questions, please open a discussion:\",\n",
+ " \"# https://github.com/sailuh/kaiaulu/discussions\",\n",
+ " \"\",\n",
+ "]\n",
+ "\n",
+ "def build_conf(template, owner, repo):\n",
+ " conf = template.copy()\n",
+ " conf.setdefault(\"project\", {})\n",
+ " conf[\"project\"][\"website\"] = f\"https://github.com/{owner}/{repo}\"\n",
+ "\n",
+ " conf.setdefault(\"issue_tracker\", {})\n",
+ " conf[\"issue_tracker\"].setdefault(\"github\", {})\n",
+ " conf[\"issue_tracker\"][\"github\"].setdefault(\"project_key_1\", {})\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"owner\"] = owner\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"repo\"] = repo\n",
+ "\n",
+ " # Keep relative paths so data lands under backend cwd (sentiment_github_dataset)\n",
+ " base_path = f\"rawdata/github/{owner}/{repo}\"\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"issue_or_pr_comment\"] = f\"{base_path}/issue_or_pr_comment/\"\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"issue_event\"] = f\"{base_path}/issue_event/\"\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"commit\"] = f\"{base_path}/commit/\"\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"commit_comments\"] = f\"{base_path}/commit_comments/\"\n",
+ " conf[\"issue_tracker\"][\"github\"][\"project_key_1\"][\"pr_comments\"] = f\"{base_path}/pr_comments/\"\n",
+ " return conf\n",
+ "\n",
+ "with open(TEMPLATE_PATH, \"r\", encoding=\"utf-8\") as f:\n",
+ " template_conf = yaml.safe_load(f)\n",
+ "\n",
+ "if MAX_REPOS is None:\n",
+ " pilot = repos.copy()\n",
+ "else:\n",
+ " pilot = repos.head(MAX_REPOS)\n",
+ "\n",
+ "print(f\"repos selected for config generation: {len(pilot)}\")\n",
+ "\n",
+ "written = []\n",
+ "for row in pilot.itertuples(index=False):\n",
+ " owner = row.owner\n",
+ " repo = row.repo\n",
+ " target_path = CONF_DIR / f\"{repo}.yml\"\n",
+ " conf = build_conf(template_conf, owner, repo)\n",
+ " yaml_body = yaml.safe_dump(conf, sort_keys=False)\n",
+ " if WRITE_CONFIGS:\n",
+ " with open(target_path, \"w\", encoding=\"utf-8\") as out:\n",
+ " out.write(\"\\n\".join(header_lines))\n",
+ " out.write(\"\\n\")\n",
+ " out.write(yaml_body)\n",
+ " written.append(target_path.name)\n",
+ "print(\"written configs:\", written)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "oiibwri0k6",
+ "metadata": {},
+ "source": [
+ "### When to Move On to Notebook 4\n",
+ "\n",
+ "Move to Notebook 4 after all of the following are true:\n",
+ "\n",
+ "1. The 82 `.yml` files generated from Step 4 exist in Kaiaulu's `conf/` directory.\n",
+ "4. Spot-check a few configs to confirm the `owner`, `repo`, and `rawdata/` paths are populated correctly and follow the formatting indicated in Step 4."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": ".venv",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/04_download_commit_comments.ipynb b/04_download_commit_comments.ipynb
new file mode 100644
index 0000000..b26da40
--- /dev/null
+++ b/04_download_commit_comments.ipynb
@@ -0,0 +1,162 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "c9604aaa",
+ "metadata": {},
+ "source": [
+ "# Download Commit Comments with Kaiaulu (Notebook 4)\n",
+ "\n",
+ "This notebook shows how to download GitHub commit comments using Kaiaulu’s `download_github_events.Rmd` notebook in the `/vignettes` folder."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a8369dad",
+ "metadata": {},
+ "source": [
+ "### Planned Output\n",
+ "\n",
+ "1. A parsed commit-comments CSV saved to `rawdata/github/{owner}/{repo}/{owner}_{repo}_commit_comments.csv` in Kaiaulu"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5ad8e21",
+ "metadata": {},
+ "source": [
+ "### Step 1: Confirm your working directory\n",
+ "\n",
+ "1. Open the Kaiaulu project in RStudio.\n",
+ "2. Run `getwd()` in the R console to check your current working directory.\n",
+ "3. If the directory is not Kaiaulu, set it with `setwd()` (for example, `setwd(\"~/Desktop/github/kaiaulu\")`)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "049737cf",
+ "metadata": {},
+ "source": [
+ "### Step 2: Create a personal access token\n",
+ "\n",
+ "This workflow makes many GitHub API requests, so you need a personal access token.\n",
+ "\n",
+ "Follow the [GitHub documentation](https://docs.github.com/en/free-pro-team@latest/github/authenticating-to-github/creating-a-personal-access-token#:~:text=Creating%20a%20token.%201%20Verify%20your%20email%20address%2C,able%20to%20see%20the%20token%20again.%20More%20items) and create a **classic** token:\n",
+ "\n",
+ "1. Go to **GitHub → Settings → Developer settings → Personal access tokens → Tokens (classic)**.\n",
+ "2. Select **Generate new token (classic)**.\n",
+ "3. Add a note (for example, \"Download GitHub commit + PR comments via Kaiaulu\").\n",
+ "4. Enable the `public_repo` scope for public repositories.\n",
+ "5. Generate the token, then copy and store it securely.\n",
+ "\n",
+ "Save the token in `~/.ssh/github_token` on your local machine."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "976db067",
+ "metadata": {},
+ "source": [
+ "### Step 3: Run `download_github_events.Rmd` chunks in RStudio\n",
+ "\n",
+ "Run the following chunks in **RStudio**. These chunks should already exist in `download_github_events.Rmd`.\n",
+ "\n",
+ "### Chunk 1: Set up dependencies\n",
+ "\n",
+ "---\n",
+ "```{r warning=FALSE,message=FALSE}\n",
+ "rm(list = ls())\n",
+ "require(kaiaulu)\n",
+ "require(data.table)\n",
+ "require(jsonlite)\n",
+ "require(knitr)\n",
+ "```\n",
+ "---\n",
+ "\n",
+ "### Chunk 2: Set required parameters\n",
+ "\n",
+ "Replace `kaiaulu.yml` with the `.yml` file for the project you want to process. You created these files in Step 4 of `03_scale_config_files.ipynb`.\n",
+ "\n",
+ "---\n",
+ "```{r}\n",
+ "conf <- parse_config(\"../conf/kaiaulu.yml\")\n",
+ "owner <- get_github_owner(conf, \"project_key_1\") # Has to match github organization (e.g. github.com/sailuh)\n",
+ "repo <- get_github_repo(conf, \"project_key_1\") # Has to match github repository (e.g. github.com/sailuh/perceive)\n",
+ "save_path_issue_or_pr_comments <- path.expand(get_github_issue_or_pr_comment_path(conf, \"project_key_1\"))\n",
+ "save_path_issue_event <- get_github_issue_event_path(conf, \"project_key_1\")\n",
+ "save_path_commit <- get_github_commit_path(conf, \"project_key_1\")\n",
+ "save_path_commit_comments <- get_github_commit_comment_path(conf, \"project_key_1\")\n",
+ "\n",
+ "# your file github_token contains the GitHub token API obtained in the steps above\n",
+ "token <- scan(\"~/.ssh/github_token\",what=\"character\",quiet=TRUE)\n",
+ "```\n",
+ "---\n",
+ "\n",
+ "### Chunk 3: Download Commit Comments\n",
+ "\n",
+ "This downloads commit-comment JSON files into `rawdata` in your current working directory. The runtime depends on how many comments the project has.\n",
+ "\n",
+ "**IMPORTANT:** This chunk uses `gh_next()` to fetch paginated results and expects `gh` version 1.2.0. If you see a `gh_next()` paging bug (for example, repeated writes to the same page), downgrade to `gh` 1.2.0.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r Collect all project commit comments, eval = FALSE}\n",
+ "dir.create(save_path_commit_comments, recursive = TRUE, showWarnings = FALSE)\n",
+ "gh_response <- github_api_project_commit_comments(owner,repo,token)\n",
+ "github_api_iterate_pages(token,gh_response,save_path_commit_comments,prefix=\"commit_comments\")\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Chunk 4: Parse Commit Comments\n",
+ "\n",
+ "After all JSON files are downloaded, run the **Parsing Raw Data to Csv** chunk for commit comments. You should see a table named `all_commit_comments` in your R environment with columns such as `comment_id`, `commit_id`, `author_login`, `author_id`, `line`, `created_at`, and `updated_at`.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r}\n",
+ "all_commit_comments <- lapply(list.files(save_path_commit_comments,full.names = TRUE),read_json)\n",
+ "all_commit_comments <- lapply(all_commit_comments,github_parse_project_commit_comments)\n",
+ "all_commit_comments <- rbindlist(all_commit_comments,fill=TRUE)\n",
+ "\n",
+ "kable(head(all_commit_comments))\n",
+ "\n",
+ "# Save the data table for commit comments as a CSV\n",
+ "out_csv <- file.path(dirname(save_path_commit_comments), paste0(owner, \"_\", repo, \"_commit_comments.csv\"))\n",
+ "data.table::fwrite(all_commit_comments, out_csv)\n",
+ "cat(\"Saved:\", out_csv, \"\\n\")\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Final Output\n",
+ "\n",
+ "Final output path:\n",
+ "`rawdata/github/{owner}/{repo}/{owner}_{repo}_commit_comments.csv`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "99981420",
+ "metadata": {},
+ "source": [
+ "### When to move on to Notebook 5\n",
+ "\n",
+ "Move to Notebook 5 after all of the following are true:\n",
+ "\n",
+ "1. The commit-comment JSON files have been downloaded successfully.\n",
+ "2. The parsed table `all_commit_comments` looks correct in RStudio.\n",
+ "3. The CSV file exists at:\n",
+ " `rawdata/github/{owner}/{repo}/{owner}_{repo}_commit_comments.csv`\n",
+ "4. Spot-check a few rows to confirm key fields (such as `comment_id`, `commit_id`, and `author_login`) are populated as expected."
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/05_download_PR_inline_comments.ipynb b/05_download_PR_inline_comments.ipynb
new file mode 100644
index 0000000..3cbcbfe
--- /dev/null
+++ b/05_download_PR_inline_comments.ipynb
@@ -0,0 +1,186 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "79eaeb30",
+ "metadata": {},
+ "source": [
+ "# Download PR Inline Comments with Kaiaulu (Notebook 5)\n",
+ "\n",
+ "This notebook shows how to download pull request inline comments using Kaiaulu’s `download_github_pull_request_comments.Rmd` notebook in the `/vignettes` folder."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "edcea498",
+ "metadata": {},
+ "source": [
+ "### Planned Output\n",
+ "\n",
+ "1. A parsed commit-comments CSV saved to `rawdata/github/{owner}/{repo}/{owner}_{repo}_pr_inline_comments.csv` in Kaiaulu"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4f30bc9f",
+ "metadata": {},
+ "source": [
+ "Before starting, complete Steps 1 and 2 in `04_download_commit_comments.ipynb` (confirm working directory and create a GitHub personal access token).\n",
+ "\n",
+ "### Step 1: Run download_github_pull_request_comments.Rmd chunks in RStudio\n",
+ "\n",
+ "Run the following chunks in **RStudio**. These chunks should already exist in `download_github_pull_request_comments.Rmd`.\n",
+ "\n",
+ "### Chunk 1: Set up dependencies\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r warning=FALSE,message=FALSE}\n",
+ "rm(list = ls())\n",
+ "require(kaiaulu)\n",
+ "require(data.table)\n",
+ "require(jsonlite)\n",
+ "require(knitr)\n",
+ "require(magrittr)\n",
+ "require(gt)\n",
+ "require(lubridate)\n",
+ "```\n",
+ "\n",
+ "--- \n",
+ "\n",
+ "### Chunk 2: Set required parameters\n",
+ "\n",
+ "Replace `kaiaulu.yml` with the `.yml` file for the project you want to process. You created these files in Step 4 of Notebook 3 (`03_scale_config_files.ipynb`).\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r warning=FALSE}\n",
+ "conf <- parse_config(\"../conf/kaiaulu.yml\")\n",
+ "owner <- get_github_owner(conf, \"project_key_1\") # Has to match github organization (e.g. github.com/sailuh)\n",
+ "repo <- get_github_repo(conf, \"project_key_1\") # Has to match github repository (e.g. github.com/sailuh/perceive)\n",
+ "\n",
+ "# Path you wish to save all raw data.\n",
+ "save_path_pull_request <- get_github_pull_request_path(conf, \"project_key_1\")\n",
+ "save_path_pr_comments <- get_github_pr_comments_path(conf, \"project_key_1\")\n",
+ "save_path_issue_or_pr_comments <- get_github_issue_or_pr_comment_path(conf, \"project_key_1\")\n",
+ "save_path_pr_reviews <- get_github_pr_review_path(conf, \"project_key_1\")\n",
+ "\n",
+ "# Lower API \n",
+ "save_path_pull_request <- get_github_pull_request_path(conf, \"project_key_1\")\n",
+ "save_path_pr_commits <- get_github_pr_commits_path(conf, \"project_key_1\")\n",
+ "save_path_pr_files <- get_github_pr_files_path(conf, \"project_key_1\")\n",
+ "save_path_pr_reviews <- get_github_pr_review_path(conf, \"project_key_1\")\n",
+ "save_path_pr_comments <- get_github_pr_comments_path(conf, \"project_key_1\")\n",
+ "\n",
+ "# Create all folder directories\n",
+ "#create_file_directory(conf)\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Chunk 3: Personal Access Token\n",
+ "\n",
+ "Point to the GitHub token created in Step 2 of Notebook 4.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r Scan GitHub Token}\n",
+ "# your file github_token (a text file) contains the GitHub token API\n",
+ "token <- scan(\"~/.ssh/github_token\",what=\"character\",quiet=TRUE)\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Chunk 4: Download Pull Request In-Line Code Comments\n",
+ "\n",
+ "This chunk downloads PR inline-comment JSON files into `rawdata` in your current working directory. The runtime depends on how many comments the project has.\n",
+ "\n",
+ "**IMPORTANT:** This chunk uses `gh_next()` to fetch paginated results and expects `gh` version 1.2.0. If you see a `gh_next()` paging bug (for example, repeated writes to the same page), downgrade to `gh` 1.2.0.\n",
+ "\n",
+ "--- \n",
+ "\n",
+ "```{r Collect Comments from Pull Requests, eval = FALSE}\n",
+ "dir.create(save_path_pr_comments, recursive = TRUE, showWarnings = FALSE)\n",
+ "gh_response <- github_api_project_pull_request_inline_comments_refresh(owner, repo, token, save_path_pr_comments)\n",
+ "github_api_iterate_pages(token, gh_response, save_path_pr_comments, prefix=\"pr_comments\")\n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "### Chunk 5: Parse PR Inline Comments\n",
+ "\n",
+ "After all JSON files are downloaded, run the parse chunk for PR inline comments. You should see a table named `inline_comments` in your R environment with columns such as `review_id`, `comment_id`, `html_url`, `created_at`, `updated_at`, `comment_user_login`, `author_association`, `file_path`, `start_line`, `line`, `original_start_line`, `original_line`, `position`, `diff_hunk`, `body`, and `commit_id`.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "```{r Parse Comments from Pull Requests}\n",
+ "inline_comments <- lapply(list.files(save_path_pr_comments, full.names = TRUE), read_json)\n",
+ "inline_comments <- lapply(inline_comments, github_parse_project_pull_request_inline_comments)\n",
+ "inline_comments <- rbindlist(inline_comments, fill = TRUE)\n",
+ "head(inline_comments,2) %>%\n",
+ " gt(auto_align = FALSE) \n",
+ "```\n",
+ "\n",
+ "---\n",
+ "\n",
+ "If `fwrite` complains about list/`NULL` columns (common for line/position fields), copy this chunk and run it right after the parse chunk:\n",
+ "\n",
+ "```{r Create CSV for Parsed Comments}\n",
+ "as_char_or_na <- function(x) {\n",
+ " if (is.null(x) || length(x) == 0) return(NA_character_)\n",
+ " if (is.list(x)) {\n",
+ " return(vapply(x, function(e) {\n",
+ " if (is.null(e) || length(e) == 0) NA_character_ else as.character(e[[1]])\n",
+ " }, character(1)))\n",
+ " }\n",
+ " as.character(x)\n",
+ "}\n",
+ "as_int_or_na <- function(x) {\n",
+ " if (is.null(x) || length(x) == 0) return(NA_integer_)\n",
+ " if (is.list(x)) {\n",
+ " return(vapply(x, function(e) {\n",
+ " if (is.null(e) || length(e) == 0) NA_integer_ else suppressWarnings(as.integer(e[[1]]))\n",
+ " }, integer(1)))\n",
+ " }\n",
+ " suppressWarnings(as.integer(x))\n",
+ "}\n",
+ "\n",
+ "for (nm in intersect(c(\"file_path\",\"diff_hunk\",\"body\",\"html_url\",\"created_at\",\"updated_at\",\"comment_user_login\",\"author_association\",\"commit_id\"), names(inline_comments))) {\n",
+ " if (is.list(inline_comments[[nm]])) inline_comments[[nm]] <- as_char_or_na(inline_comments[[nm]])\n",
+ "}\n",
+ "for (nm in intersect(c(\"review_id\",\"comment_id\",\"start_line\",\"line\",\"original_start_line\",\"original_line\",\"position\"), names(inline_comments))) {\n",
+ " if (is.list(inline_comments[[nm]])) inline_comments[[nm]] <- as_int_or_na(inline_comments[[nm]])\n",
+ "}\n",
+ "\n",
+ "out_csv <- file.path(dirname(save_path_pr_comments), paste0(owner, \"_\", repo, \"_pr_inline_comments.csv\"))\n",
+ "data.table::fwrite(inline_comments, out_csv)\n",
+ "cat(\"Saved:\", out_csv, \"\\n\")\n",
+ "```\n",
+ "\n",
+ "### Final Output\n",
+ "\n",
+ "Final output path:\n",
+ "`rawdata/github/{owner}/{repo}/{owner}_{repo}_pr_inline_comments.csv`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1e3807c9",
+ "metadata": {},
+ "source": [
+ "### Next Steps\n",
+ "\n",
+ "1. Run Notebooks 4 and 5 for each project configuration (`.yml`) you want to process.\n",
+ "2. Confirm that each run generates the expected commit-comment and PR inline-comment CSV outputs.\n",
+ "3. Use `comment_id` as the join key to transfer sentiment labels to both commit comments and PR inline comments."
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}