Skip to content

Latest commit

 

History

History
800 lines (681 loc) · 21.1 KB

File metadata and controls

800 lines (681 loc) · 21.1 KB

Coding Platform - Architecture & Design

Dieses Dokument beschreibt die System-Architektur, Design-Entscheidungen und technische Details der ThemisDB Coding Platform.

📋 Inhaltsverzeichnis

  1. System Overview
  2. Component Architecture
  3. Data Flow
  4. Database Schema
  5. API Design
  6. Embedding Pipeline
  7. Scalability
  8. Security

1. System Overview

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Client Layer                         │
│  ┌──────────────────┐  ┌──────────────┐  ┌───────────────┐ │
│  │  Tkinter Desktop │  │VSCode Ext    │  │  Web UI       │ │
│  │      App         │  │              │  │  (Future)     │ │
│  └────────┬─────────┘  └──────┬───────┘  └───────┬───────┘ │
└───────────┼────────────────────┼──────────────────┼─────────┘
            │                    │                  │
            └────────────────────┼──────────────────┘
                                 │ HTTP REST API
                                 │
┌────────────────────────────────▼─────────────────────────────┐
│                    Application Layer                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │             API Server (Flask/FastAPI)               │   │
│  │  - Authentication & Authorization                    │   │
│  │  - Request Validation                                │   │
│  │  - Rate Limiting                                     │   │
│  └──────────┬───────────────────────────────────────────┘   │
│             │                                                │
│  ┌──────────▼───────────┐  ┌─────────────────────────┐     │
│  │  Business Logic      │  │   Background Workers    │     │
│  │  - Snippet Manager   │  │  - Scraping Jobs        │     │
│  │  - Project Manager   │  │  - Embedding Generation │     │
│  │  - Search Engine     │  │  - Index Maintenance    │     │
│  └──────────┬───────────┘  └──────────┬──────────────┘     │
└─────────────┼───────────────────────────┼──────────────────┘
              │                           │
              └───────────┬───────────────┘
                          │
┌─────────────────────────▼──────────────────────────────────┐
│                    Data Layer                              │
│  ┌──────────────────────────────────────────────────────┐ │
│  │                  ThemisDB                            │ │
│  │  ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ │
│  │  │ Vector Model │ │Document Model│ │ Graph Model │ │ │
│  │  │ - Embeddings │ │ - Snippets   │ │ - Relations │ │ │
│  │  │ - Similarity │ │ - Projects   │ │ - Deps      │ │ │
│  │  └──────────────┘ └──────────────┘ └─────────────┘ │ │
│  └──────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘

Technology Stack

Frontend:

  • Tkinter (Desktop GUI)
  • TypeScript + VSCode Extension API

Backend:

  • Python 3.8+
  • Flask/FastAPI (API Server)
  • Celery (Background Tasks)

Database:

  • ThemisDB (Multi-Model)
    • Vector Model (Code Embeddings)
    • Document Model (Snippets, Docs)
    • Graph Model (Dependencies, Relations)

ML/AI:

  • sentence-transformers (Embeddings)
  • microsoft/codebert-base (Code Embeddings)
  • tree-sitter (Code Parsing)

Web Scraping:

  • requests (HTTP)
  • beautifulsoup4 (HTML Parsing)
  • PyGithub (GitHub API)
  • selenium (JS-rendered pages)

2. Component Architecture

Desktop Application

main.py
├── UI Layer (ui/)
│   ├── MainWindow
│   │   ├── MenuBar
│   │   ├── Toolbar
│   │   └── TabWidget
│   ├── SnippetPanel
│   │   ├── SnippetList
│   │   ├── SnippetEditor
│   │   └── SnippetDetails
│   ├── ProjectPanel
│   │   ├── ProjectTree
│   │   ├── FileEditor
│   │   └── ProjectSettings
│   ├── SearchPanel
│   │   ├── SearchBar
│   │   ├── FilterPanel
│   │   └── ResultsList
│   └── ScraperPanel
│       ├── JobList
│       ├── JobConfig
│       └── JobProgress
│
├── Business Logic
│   ├── SnippetManager
│   ├── ProjectManager
│   ├── SearchEngine
│   └── ScraperManager
│
└── Data Access (themis_client.py)
    └── ThemisDBClient

VSCode Extension

extension.ts
├── Activation
   ├── Configuration
   ├── API Client
   └── Providers Registration

├── Providers
   ├── SnippetTreeProvider
      └── TreeView Data
   ├── CodeLensProvider
      └── Similar Code Hints
   ├── CompletionProvider
      └── Auto-Suggestions
   └── HoverProvider
       └── Snippet Preview

├── Commands
   ├── searchSnippets()
   ├── insertSnippet()
   ├── saveSnippet()
   └── findSimilar()

└── API Client
    └── REST API Communication

Web Scraper

web_scraper.py
├── BaseScraper (Abstract)
│   ├── fetch_with_retry()
│   ├── parse_content()
│   ├── extract_code()
│   └── create_snippet()
│
├── GitHubScraper
│   ├── scrape_repository()
│   ├── scrape_gists()
│   └── search_and_scrape()
│
├── StackOverflowScraper
│   ├── scrape_questions()
│   ├── extract_code_from_answers()
│   └── scrape_user_answers()
│
└── DocsCrawler
    ├── crawl_documentation()
    ├── extract_code_examples()
    └── build_doc_graph()

Code Indexer

code_indexer.py
├── EmbeddingGenerator
│   ├── generate_embedding()
│   ├── batch_generate()
│   └── update_model()
│
├── SimilarityEngine
│   ├── find_similar()
│   ├── compute_similarity()
│   └── rank_results()
│
└── Deduplicator
    ├── detect_duplicates()
    ├── merge_similar()
    └── hash_code()

3. Data Flow

Snippet Creation Flow

User Input
    │
    ▼
┌────────────────┐
│  UI Component  │
└────────┬───────┘
         │ validate()
         ▼
┌────────────────┐
│ SnippetManager │
└────────┬───────┘
         │ create_snippet()
         ▼
┌────────────────┐
│ CodeIndexer    │ ──→ generate_embedding()
└────────┬───────┘
         │ snippet + embedding
         ▼
┌────────────────┐
│ ThemisDB Client│ ──→ POST /api/snippets
└────────┬───────┘
         │
         ▼
┌────────────────┐
│  ThemisDB      │
│  - Document    │ ──→ Store metadata, code
│  - Vector      │ ──→ Store embedding, build index
└────────┬───────┘
         │
         ▼
    Success

Semantic Search Flow

User Query
    │
    ▼
┌────────────────┐
│  Search Panel  │
└────────┬───────┘
         │ query string
         ▼
┌────────────────┐
│ CodeIndexer    │ ──→ generate_query_embedding()
└────────┬───────┘
         │ query embedding
         ▼
┌────────────────┐
│ ThemisDB Client│ ──→ POST /api/search/semantic
└────────┬───────┘
         │
         ▼
┌────────────────┐
│  ThemisDB      │
│  Vector Search │ ──→ HNSW / FAISS similarity
└────────┬───────┘
         │ ranked results
         ▼
┌────────────────┐
│  Search Panel  │ ──→ Display results
└────────────────┘

Web Scraping Flow

Scraping Job Config
    │
    ▼
┌────────────────┐
│ Scraper Panel  │ ──→ create_job()
└────────┬───────┘
         │
         ▼
┌────────────────┐
│ ScraperManager │ ──→ dispatch_to_worker()
└────────┬───────┘
         │
         ▼
┌────────────────┐
│ Background     │
│ Worker (Celery)│
└────────┬───────┘
         │
         ├──→ fetch_content()
         ├──→ parse_code()
         ├──→ detect_language()
         ├──→ generate_embedding()
         ├──→ check_duplicate()
         └──→ store_snippet()
              │
              ▼
         ┌────────────────┐
         │  ThemisDB      │
         └────────┬───────┘
                  │
                  ▼
         Update Job Status
              │
              ▼
         Notify UI

4. Database Schema

ThemisDB Collections

Snippets (Document Model)

{
  "collection": "snippets",
  "schema": {
    "id": "uuid",
    "title": "string",
    "description": "string",
    "code": "text",
    "language": "string",
    "framework": "string (optional)",
    "tags": ["string"],
    "metadata": {
      "author": "string",
      "source_url": "string",
      "source_type": "enum(github, stackoverflow, custom)",
      "license": "string",
      "stars": "integer",
      "created_at": "timestamp",
      "updated_at": "timestamp"
    },
    "stats": {
      "views": "integer",
      "copies": "integer",
      "likes": "integer"
    }
  },
  "indexes": [
    {"field": "language"},
    {"field": "framework"},
    {"field": "tags"},
    {"field": "metadata.created_at"}
  ]
}

Embeddings (Vector Model)

{
  "collection": "embeddings",
  "schema": {
    "snippet_id": "uuid (foreign key)",
    "vector": "float[] (512 dimensions)",
    "model": "string (e.g., codebert-base)",
    "created_at": "timestamp"
  },
  "vector_index": {
    "type": "HNSW",
    "dimensions": 512,
    "distance_metric": "cosine"
  }
}

Projects (Document Model)

{
  "collection": "projects",
  "schema": {
    "id": "uuid",
    "name": "string",
    "description": "string",
    "language": "string",
    "framework": "string",
    "files": [{
      "path": "string",
      "content": "text",
      "language": "string",
      "size": "integer"
    }],
    "structure": {
      "type": "tree",
      "root": "TreeNode"
    },
    "dependencies": ["string"],
    "readme": "text",
    "tags": ["string"],
    "metadata": {
      "source_url": "string",
      "source_type": "string",
      "stars": "integer",
      "forks": "integer"
    }
  }
}

Relations (Graph Model)

{
  "collection": "relations",
  "edges": [
    {
      "from": "snippet_id",
      "to": "snippet_id",
      "type": "similar_to",
      "weight": "float (similarity score)"
    },
    {
      "from": "snippet_id",
      "to": "project_id",
      "type": "part_of"
    },
    {
      "from": "project_id",
      "to": "project_id",
      "type": "depends_on"
    }
  ]
}

5. API Design

RESTful API

Base URL: http://localhost:8080/api/v1

Snippets

# Search snippets (semantic)
POST /snippets/search
Content-Type: application/json
{
  "query": "async HTTP request",
  "language": "python",
  "limit": 10,
  "filters": {
    "framework": "aiohttp",
    "min_stars": 10
  }
}

# Get snippet
GET /snippets/:id

# Create snippet
POST /snippets
{
  "title": "...",
  "code": "...",
  "language": "python",
  "tags": ["..."]
}

# Update snippet
PUT /snippets/:id

# Delete snippet
DELETE /snippets/:id

# Find similar
POST /snippets/similar
{
  "code": "...",
  "language": "python",
  "limit": 5
}

Projects

# List projects
GET /projects?language=python&framework=fastapi

# Get project
GET /projects/:id

# Create project
POST /projects

# Import from GitHub
POST /projects/import
{
  "source": "github",
  "url": "https://github.com/user/repo"
}

Scraping Jobs

# Create job
POST /scraping/jobs
{
  "type": "github_repo",
  "url": "...",
  "config": {...}
}

# Get job status
GET /scraping/jobs/:id

# List jobs
GET /scraping/jobs?status=completed

# Cancel job
DELETE /scraping/jobs/:id

Response Format

Success:

{
  "status": "success",
  "data": {...},
  "meta": {
    "total": 100,
    "page": 1,
    "per_page": 10
  }
}

Error:

{
  "status": "error",
  "error": {
    "code": "INVALID_INPUT",
    "message": "Language must be specified",
    "details": {...}
  }
}

6. Embedding Pipeline

Model Selection

# Primary Model: CodeBERT
MODEL = "microsoft/codebert-base"
DIMENSIONS = 512

# Alternative Models:
# - "microsoft/graphcodebert-base" (better for code structure)
# - "sentence-transformers/all-MiniLM-L6-v2" (faster, smaller)

Embedding Generation

from sentence_transformers import SentenceTransformer

class EmbeddingGenerator:
    def __init__(self):
        self.model = SentenceTransformer('microsoft/codebert-base')
    
    def generate(self, code: str, language: str) -> np.ndarray:
        # Preprocess code
        code = self.normalize_code(code, language)
        
        # Generate embedding
        embedding = self.model.encode(code)
        
        # Normalize (unit vector)
        embedding = embedding / np.linalg.norm(embedding)
        
        return embedding
    
    def normalize_code(self, code: str, language: str) -> str:
        # Remove comments
        code = self.remove_comments(code, language)
        
        # Normalize whitespace
        code = ' '.join(code.split())
        
        # Optional: Extract function signatures, variable names
        # for better semantic understanding
        
        return code

Similarity Search

def find_similar(query_embedding: np.ndarray, limit: int = 10):
    # ThemisDB Vector Search
    results = client.vector_search(
        collection="embeddings",
        query_vector=query_embedding,
        limit=limit,
        distance_metric="cosine"
    )
    
    # Results ranked by similarity
    return results

7. Scalability

Horizontal Scaling

API Server:

Load Balancer (nginx)
    │
    ├──→ API Server 1
    ├──→ API Server 2
    └──→ API Server 3

Background Workers:

Celery Workers (auto-scale)
    ├──→ Worker 1 (scraping)
    ├──→ Worker 2 (scraping)
    ├──→ Worker 3 (embedding)
    └──→ Worker 4 (embedding)

Database Sharding (ThemisDB):

Snippets by Language:
    Shard 1: Python, Ruby
    Shard 2: JavaScript, TypeScript
    Shard 3: Java, Kotlin, Scala
    Shard 4: C++, Rust, Go

Caching Strategy

Multi-Layer Cache:

Request
    │
    ▼
┌─────────────────┐
│  Redis Cache    │ ◄─── Most frequent queries (TTL: 5 min)
└─────────┬───────┘
          │ (miss)
          ▼
┌─────────────────┐
│  Local Cache    │ ◄─── Recent results (LRU, max 1000)
└─────────┬───────┘
          │ (miss)
          ▼
┌─────────────────┐
│  ThemisDB       │ ◄─── Source of truth
└─────────────────┘

Performance Optimization

Embeddings:

  • Batch generation (32-64 snippets at once)
  • GPU acceleration when available
  • Model quantization for faster inference

Vector Search:

  • HNSW index for O(log N) search
  • Pre-filtering by language/framework
  • Parallel search across shards

8. Security

Authentication & Authorization

# JWT-based authentication
Authorization: Bearer <jwt_token>

# Token payload
{
  "user_id": "uuid",
  "email": "user@example.com",
  "roles": ["user", "admin"],
  "exp": 1640000000
}

Access Control

# Snippet visibility
class Snippet:
    visibility: enum("public", "private", "team")
    owner_id: uuid
    team_id: uuid (optional)

# Access rules
def can_access(user, snippet):
    if snippet.visibility == "public":
        return True
    elif snippet.visibility == "private":
        return user.id == snippet.owner_id
    elif snippet.visibility == "team":
        return user.team_id == snippet.team_id

Input Validation

from pydantic import BaseModel, validator

class SnippetCreate(BaseModel):
    title: str
    code: str
    language: str
    tags: List[str]
    
    @validator('code')
    def code_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Code cannot be empty')
        return v
    
    @validator('language')
    def valid_language(cls, v):
        allowed = ['python', 'javascript', 'java', ...]
        if v not in allowed:
            raise ValueError(f'Language must be one of {allowed}')
        return v

Rate Limiting

from flask_limiter import Limiter

limiter = Limiter(
    app,
    key_func=lambda: request.headers.get('Authorization'),
    default_limits=["100 per hour"]
)

@app.route('/api/snippets/search', methods=['POST'])
@limiter.limit("20 per minute")
def search_snippets():
    pass

Security Best Practices

  • XSS Prevention: Sanitize all user input
  • SQL Injection: Use parameterized queries
  • CSRF: Use CSRF tokens for state-changing operations
  • Secrets Management: Use environment variables, never hardcode
  • HTTPS: Enforce TLS for all API communication
  • Logging: Log all access attempts, audit trail

📚 Design Decisions

Why Multi-Model Database (ThemisDB)?

  1. Vector Model: Essential for semantic code search
  2. Document Model: Flexible schema for diverse code structures
  3. Graph Model: Natural fit for dependencies and relationships

Why CodeBERT for Embeddings?

  • Pre-trained on code (6 programming languages)
  • Better understanding of code semantics vs general text models
  • Good balance between quality and speed

Why Tkinter for Desktop App?

  • Standard library (no external dependencies)
  • Cross-platform (Windows, macOS, Linux)
  • Simple and lightweight
  • Sufficient for proof-of-concept

Future Improvements

  • Web Interface: React/Vue.js for broader accessibility
  • Real-time Collaboration: WebSocket for live code sharing
  • AI Code Generation: LLM integration for code completion
  • Plugin System: Extensible architecture for custom scrapers

Questions? GitHub Discussions