https://github.com/sailuh/sentiment_classifier
Purpose
Explore the GHTorrent 2004 schema to identify all tables and columns that can provide contextual information about the sentiment-labeled comments from the GitHub Gold Standard dataset.
Background
This work connects the GitHub Gold Standard sentiment dataset to the GHTorrent 2004 database to add contextual information (e.g., project data, user data, timestamps). The contextual data will enable Kaiaulu to re-download 2004-2025 project data for temporal expansion.
Related: See Kaiaulu issue #226 for the comment schema structure that Kaiaulu expects.
Current Schema
What we're starting with:
- Github Gold Standard dataset loaded into MySQL as
comment_sentiment table (per SQL queries provided by @carlosparadis in Discussion Post #6)
- Has 7,122 sentiment-labeled comments
- Columns:
ID (comment ID), Polarity (sentiment), Text (comments)
- Key connection:
comment_sentiment.ID maps to commit_comments.comment_id in GHTorrent database
Here are the first few rows from the GitHub Gold Standard dataset:

What we're connecting to:
- GHTorrent has project/user/timestamp data spread across multiple tables
- Current JOIN path:
comment_sentiment → commit_comments → commits → projects
This is the GHTorrent schema structure (retrieved here):

What we need to produce:
- Kaiaulu-compatible format (see issue #226)
- Contextual dataset with enough info for Kaiaulu to re-download project data
Process
- Examine table structures, relationships, and sample data in the GHTorrent 2004 database
- Focus on tables that can link to comment data (projects, users, commits, issues, pull requests)
- Analyze join behavior (INNER vs LEFT JOIN)
- Document findings in the
sentiment_github_dataset repository
- Iterate with @carlosparadis to determine which columns are most relevant
Task List
https://github.com/sailuh/sentiment_classifier
Purpose
Explore the GHTorrent 2004 schema to identify all tables and columns that can provide contextual information about the sentiment-labeled comments from the GitHub Gold Standard dataset.
Background
This work connects the GitHub Gold Standard sentiment dataset to the GHTorrent 2004 database to add contextual information (e.g., project data, user data, timestamps). The contextual data will enable Kaiaulu to re-download 2004-2025 project data for temporal expansion.
Related: See Kaiaulu issue #226 for the comment schema structure that Kaiaulu expects.
Current Schema
What we're starting with:
comment_sentimenttable (per SQL queries provided by @carlosparadis in Discussion Post #6)ID(comment ID),Polarity(sentiment),Text(comments)comment_sentiment.IDmaps tocommit_comments.comment_idin GHTorrent databaseHere are the first few rows from the GitHub Gold Standard dataset:

What we're connecting to:
comment_sentiment→commit_comments→commits→projectsThis is the GHTorrent schema structure (retrieved here):

What we need to produce:
Process
sentiment_github_datasetrepositoryTask List
commit_comments,issue_comments,pull_request_comments)projects,repositories)users) and their relationship to comments