A comprehensive system for scraping, scoring, and ranking GitHub contributors across multiple repositories with social media enrichment capabilities.
- Multi-Repository Tracking: Track contribution activity across any number of public GitHub repositories
- Weighted Scoring: Configurable scoring system for commits, PRs, reviews, issues, and comments
- Dual Leaderboards: Per-repository rankings AND global master leaderboard aggregating all repos
- Profile Enrichment: Automatic discovery of contributor social media profiles and contact information
- BigQuery Integration: Efficient historical data retrieval from GitHub Archive (2011-present)
- Real-time Monitoring Dashboard: Live logs, job status, global & per-repo leaderboards
- Background Processing: Celery workers for async data ingestion and processing
- Pagination: All leaderboards support pagination to handle large datasets
See docs/pipeline.md for detailed Mermaid diagrams of the full data pipeline.
| Component | Technology |
|---|---|
| API | Python FastAPI |
| Database | PostgreSQL 15+ |
| Cache/Queue | Redis |
| Job Processing | Celery |
| Historical Data | Google BigQuery (GitHub Archive) |
| Real-time Data | GitHub REST API |
| Monitoring | WebSocket-based live dashboard |
- Python 3.11+
- Docker (for PostgreSQL and Redis)
- GitHub Personal Access Token
- Google Cloud Project with BigQuery access
- Clone the repository:
git clone https://github.com/MichaelTheMay/github-contributor-leaderboard.git
cd github-contributor-leaderboard- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -e ".[dev]"- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration- Start PostgreSQL and Redis:
docker run -d --name postgres-leaderboard \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=leaderboard \
-p 5433:5432 postgres:15-alpine
docker run -d --name redis-leaderboard \
-p 6379:6379 redis:7-alpine- Initialize the database:
python scripts/init_db.py- Start the API server:
uvicorn src.api.app:app --reload- Open the dashboard: Navigate to http://localhost:8000 in your browser.
cd docker
docker-compose up -dFor fully automated operation via GitHub Actions:
- Set up a cloud PostgreSQL database (see docs/cloud-database-setup.md)
- Configure GitHub Secrets:
DATABASE_URL- Cloud PostgreSQL connection stringBIGQUERY_PROJECT- Google Cloud project IDBIGQUERY_CREDENTIALS- Service account JSON contentGH_PAT- GitHub Personal Access Token
The scheduled pipeline runs daily at 2 AM UTC and automatically:
- Refreshes stale repositories
- Recalculates leaderboards
- Enriches top 50 unenriched contributors
Manual actions available via Actions > Scheduled Pipeline > Run workflow:
full-pipeline- Complete daily pipelineenrich-top-contributors- Enrich N users on demandenrichment-backfill- Backfill top N% of contributors
| Variable | Description | Required |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | Yes |
REDIS_URL |
Redis connection string | Yes |
GITHUB_TOKEN |
GitHub Personal Access Token | Yes |
BIGQUERY_PROJECT |
Google Cloud project ID | Yes |
GOOGLE_APPLICATION_CREDENTIALS |
Path to GCP service account JSON | Yes |
Create a Personal Access Token with these scopes:
repo- Full control of private repositoriesread:user- Read user profile dataread:org- Read organization membership
- Create a Google Cloud project
- Enable the BigQuery API
- Create a service account with roles:
- BigQuery User
- BigQuery Job User
- Download the service account JSON key
The monitoring dashboard provides real-time visibility into the system:
- Live Statistics: Active jobs, repository count, processed events, errors
- Job Monitoring: View all scrape jobs with status, progress, and timing
- Repository List: All tracked repositories with status indicators
- Global Leaderboard: View top contributors across all repositories
- Live Logs: WebSocket-powered real-time log streaming with filtering
The pipeline control center lets you manage the entire data pipeline:
- Add Repositories: Track new GitHub repositories
- Scrape Data: Fetch contribution data from BigQuery
- Calculate Scores: Apply scoring formula to all entries
- Aggregate Global: Build master leaderboard from all repos
- Enrichment: Discover contributor social media profiles
- Budget Tracking: Monitor BigQuery usage and costs
- Navigate to http://localhost:8000/dashboard/pipeline
- Add a repository (e.g.,
pytorch/pytorch) - Click Scrape to fetch historical data from BigQuery
- Click Aggregate Global to build the master leaderboard
- Go to Contributor Enrichment section
- Click Top 0.1% to enrich top contributors
- View enriched data in the interactive table
- Debug: Verbose debugging information
- Info: General operational messages
- Warning: Potential issues that don't stop operation
- Error: Failures that require attention
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/v1/repositories |
Add a repository to track |
GET |
/api/v1/repositories |
List all tracked repositories |
GET |
/api/v1/repositories/{owner}/{name} |
Get repository details |
DELETE |
/api/v1/repositories/{owner}/{name} |
Remove repository |
POST |
/api/v1/repositories/{owner}/{name}/refresh |
Trigger data refresh |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/leaderboard/global |
Global leaderboard |
GET |
/api/v1/leaderboard/{owner}/{name} |
Repository leaderboard |
GET |
/api/v1/leaderboard/compare |
Compare multiple repositories |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/contributors/{username} |
Full contributor profile |
GET |
/api/v1/contributors/{username}/activity |
Contribution history |
GET |
/api/v1/contributors/{username}/repositories |
Per-repo breakdown |
POST |
/api/v1/contributors/{username}/enrich |
Trigger enrichment |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/config/scoring |
Get scoring weights |
PUT |
/api/v1/config/scoring |
Update scoring weights |
| Method | Endpoint | Description |
|---|---|---|
GET |
/dashboard |
Monitoring dashboard UI |
GET |
/dashboard/stats |
Current system statistics |
GET |
/dashboard/errors |
Recent error list |
WS |
/ws/logs |
WebSocket for live logs |
Default scoring weights:
| Event Type | Points | Description |
|---|---|---|
| Release Published | 30 | Major milestone |
| PR Merged | 25 | Code successfully integrated |
| PR Reviewed | 20 | Code review contribution |
| PR Opened | 15 | Initiative to contribute |
| Commit | 10 + line bonus | Direct code contribution |
| Issue Opened | 8 | Bug reports, feature requests |
| Issue Closed | 5 | Resolution of issues |
| PR Review Comment | 5 | Detailed feedback |
| Comment | 3 | Discussion participation |
Line bonus: +0.1 per line added, +0.05 per line deleted
github-contributor-leaderboard/
├── src/
│ ├── api/ # FastAPI application
│ │ ├── routes/ # API endpoints + dashboard
│ │ ├── schemas/ # Pydantic models
│ │ └── templates/ # Dashboard HTML
│ ├── core/ # Configuration
│ ├── db/ # Database layer
│ │ ├── models/ # SQLAlchemy models
│ │ └── repositories/ # Data access
│ ├── services/ # Business logic
│ ├── workers/ # Celery tasks
│ └── enrichment/ # Profile enrichment
├── tests/
│ ├── unit/
│ └── integration/
├── migrations/ # Alembic migrations
├── docker/ # Docker configuration
├── scripts/ # Utility scripts
└── secrets/ # Credentials (gitignored)
Start Celery workers for background processing:
# Start worker
celery -A src.workers.celery_app worker --loglevel=info
# Start scheduler (for periodic tasks)
celery -A src.workers.celery_app beat --loglevel=info- scrape_repository: Full historical data ingestion from BigQuery
- incremental_scrape: Fetch new events since last scrape
- recalculate_leaderboard: Recompute rankings after new data
- enrich_contributor: Gather social media profiles
pytest# Linting
ruff check .
# Type checking
mypy src
# Format code
ruff format .# Create new migration
alembic revision --autogenerate -m "description"
# Apply migrations
alembic upgrade head
# Rollback
alembic downgrade -1The primary data source for historical contribution data. Contains all public GitHub events since 2011.
- Dataset:
githubarchive.day.* - Cost: ~$5-20 per large query
- Delay: ~1 hour from real-time
Used for:
- Repository validation and metadata
- User profile information
- Real-time data when needed
- Profile README parsing for enrichment
Rate Limit: 5,000 requests/hour with authentication
The Pipeline Control dashboard at /dashboard/pipeline provides a unified interface for managing the entire data pipeline:
Enrich contributor profiles with social media and contact information:
- Single User Enrichment: Enrich a specific GitHub user by username
- Batch Enrichment: Enrich top N contributors by global score
- Quick Actions: One-click buttons for top 0.1%, 0.5%, 1%, 5%
- Progress Tracking: Real-time stats showing enrichment coverage
The enrichment system extracts data from 30+ platforms:
| Category | Platforms |
|---|---|
| Social Media | Twitter/X, LinkedIn, Bluesky, Mastodon, Threads, Instagram, Facebook |
| Developer | GitHub Sponsors, GitLab, npm, PyPI, CodePen, Dribbble, Behance |
| Video/Streaming | YouTube, Twitch, TikTok |
| Community | Reddit, Hacker News, Stack Overflow, Discord, Telegram, Keybase |
| Professional | Personal websites, Substack, Medium, dev.to, Kaggle |
- GitHub Profile API: Email, Twitter, company, website, bio, location
- Profile README: Badges, links, shields.io badges, social icons
- Fallback APIs (optional): LinkedIn enrichment via Proxycurl
The dashboard shows real-time coverage statistics:
- Total users enriched vs. unenriched
- Breakdown by status (complete, partial, pending, failed)
- Count of each social platform found
- Last 24h activity metrics
Export enriched contributor data to CSV with all social columns for external analysis.
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
