Bot Detector Research Project

Academic research project for classifying automated HTTP clients (bots, LLMs, crawlers) vs real browsers using transport-level fingerprinting.

Version: 1.1.1 | Changelog | Methodology

Performance Highlights

Localhost (10s, 50 concurrent connections; Go server only, no nginx in front):

Mode	RPS	RPM	Latency avg
HTTP (no TLS)	~11,550	~693K	~4.3 ms
HTTPS (TLS fingerprinting, JA3/JA4/JA4H)	~8,210	~493K	~6.1 ms

Over the network (patched nginx + TLS termination + X-FP-* at edge, 10s, 50 concurrent, HTTPS):

Endpoint	RPS	RPM	Latency avg
GET / (classify)	~2,640	~158K	~18.9 ms
GET /health	~4,242	~255K	~11.8 ms

Summary (10s, c=50, same host): Localhost: 8–11K RPS, 4–6 ms avg. Over the network (nginx/TLS): classify ~2.6K RPS, ~18.9 ms; health ~4.2K RPS, ~11.8 ms — so classification adds ~7 ms vs health. With Redis + on-the-fly request metrics on the same host: ~1,620 RPS, 30.9 ms (vs classify without Redis: −39% RPS, +12 ms per request). Takeaways: classification cost ~7 ms over transport; Redis + stats add ~12 ms and ~40% throughput drop on endpoints that build request_metrics.

Project Goal

Create a single HTTP endpoint that classifies clients as browser or bot based exclusively on:

TLS handshake patterns (JA3/JA4 fingerprinting)
HTTP/2 negotiation behavior
Header structure and semantics
Request patterns

No JavaScript challenges, no rate limiting — pure network fingerprinting.

Current Status

Phase 1 [COMPLETED] — TLS + HTTP fingerprinting: ClientHello capture, JA3/JA4/JA4H, TLS and HTTP signals in scoring, JA4H↔HTTP consistency (evasion detection), HTTPS server mode.

Phase 2 — HTTP/2: H2 fingerprint consumed from proxy (X-FP-H2); SETTINGS/PRIORITY/window come from nginx modules at the edge (e.g. nginx-http2-fingerprint) and are used in classification when present. No low-level H2 parsing in Go. Planned: H2/H3 ratio tracking. See docs/nginx.md and Methodology → Phase 2.

Phase 3 — Inconsistency detection: spatial (JA4H vs HTTP, TLS/HTTP version mismatch) in place. Planned: temporal inconsistency (same client, changing FPs), header–UA validation. See Methodology → Phase 3.

Redis & behavioural (Appendix L, M) — Optional Redis (REDIS_URL): (1) Challenge store — nonce→User-Agent stored in Redis so multiple instances share state; (2) Behavioural metrics — request counts and timestamps per IP and per __ch_nonce (sliding window). When behavioral_edges are set in scoring config, the classifier adds bot score for rate and inter-arrival conditions (Appendix M). /debug returns request_metrics for the current request. See docs/deploy/README.md, Methodology Appendix L, config/README.md.

See CHANGELOG.md for detailed release notes.

Architecture

Direct TLS (Go terminates HTTPS):

client → TLS listener (Go) → fingerprint collector → classifier → response

Via nginx (TLS termination at edge, fingerprint via headers):

client → nginx (TLS + JA3 + H2 fingerprint) → proxy_pass → Go (HTTP :8080, X-FP-* headers) → collector → classifier → response

See docs/nginx.md and Methodology Appendix F.

Tech Stack

Core: Go (HTTP/2 server, TLS fingerprinting, classification)
Analytics: Python (log analysis, pattern extraction). Request log statistics — tools/python/request_log_stats.py aggregates JSONL logs: top-N by path, method, IP, user agent, JA3/JA4/JA4H, headers; bot/browser breakdown; scoring-signal prevalence; optional significance filter (√N). See tools/python/README.md and Methodology Appendix J.
Logging: Structured JSON logs per day (logs/requests_YYYYMMDD.jsonl) for research analysis

Project Structure

.
├── cmd/
│   └── server/          # HTTP server entry point
├── internal/
│   ├── config/          # Scoring config loader (JSON → classifier + fingerprint)
│   ├── fingerprint/     # TLS/HTTP signal collection
│   ├── classifier/      # Rule-based classification
│   ├── logger/          # Structured JSON logging
│   ├── metrics/         # Behavioral metrics (Redis: per IP, per __ch_nonce; Appendix L)
│   └── server/          # HTTP handlers (challenge store, Redis wiring)
├── config/              # Scoring config (scoring.json, scoring.default.json, README)
├── tests/
│   ├── integration/     # Automated client tests
│   ├── unit/            # Unit tests
│   └── testdata/        # Test stubs (e.g. ja4db_fixture.json, reference_*.json)
├── tools/
│   ├── benchmark/       # HTTP benchmark tool
│   ├── python/          # Analytics tools
│   └── shell/           # Integration test scripts
├── internal/fingerprint/data/  # JA4 DB path (ja4db.json downloaded on first start if missing)
├── logs/                # JSON traffic logs (requests_YYYYMMDD.jsonl per day)
└── docs/                # Research documentation

Classification Signals

TLS Level

Full ClientHello capture via custom TLS listener
JA3/JA4 fingerprint hashing
ALPN negotiation (h2, http/1.1)
Cipher suite count and complexity (15+ suggests browser)
TLS extensions count (10+ suggests browser)
Supported versions, signature schemes, elliptic curve groups
Session ticket and early data support

HTTP Level

HTTP/2 vs HTTP/1.1; HTTP/2 fingerprint (SETTINGS, PRIORITY, window) when provided by proxy
JA4H fingerprinting (HTTP fingerprint from JA4+ family)
Header order and structure; browser-specific headers (sec-fetch-*, accept-language); header count and entropy
Cross-signal consistency: JA4H vs HTTP; TLS vs User-Agent (known library/browser JA3/JA4); H2 vs JA4 (ALPN); TLS ALPN vs HTTP version (direct TLS)
Absence signals (direct TLS only): missing SNI or ALPN when TLS is available scores toward bot; optional browser bonus when no smoking-gun bot signals fire (see config/README.md)

Research Workflow

Collect: Run server, generate traffic (curl, browsers, LLM tools)
Log: All requests logged as structured JSON to daily files (logs/requests_YYYYMMDD.jsonl)
Analyze: Run request_log_stats.py on JSONL logs for top-N by path/method/IP/fingerprint and scoring-signal prevalence; see Methodology Appendix J
Iterate: Update classification heuristics based on findings
Test: Automated integration tests validate behavior

Getting Started

Prerequisites

Go 1.22+ — Download installers for Windows, macOS, Linux; or install via package manager (e.g. winget install GoLang.Go, brew install go, apt install golang-go). Ensure go is on your PATH.
Go tools directory in PATH — add $HOME/go/bin (default when Go is installed in the usual way). Required so task and golangci-lint are found after go install. Using this explicit path avoids errors when go cannot read the current directory (e.g. after sudo su). Do not install the task or taskwarrior apt/snap packages (they are different programs).
TLS certificate and key (for HTTPS mode)

Installation

# Clone repository
git clone https://github.com/muliwe/go-client-classifier.git
cd go-client-classifier

# Install dependencies and dev tools
go mod tidy
go install github.com/go-task/task/v3/cmd/task@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest

# Ensure Go bin is on PATH (required for `task` and `golangci-lint`)
# Use explicit path so it works even when current directory has permission issues (e.g. after sudo su)
export PATH=$PATH:$HOME/go/bin

# To make it permanent, add the same line to your shell profile and reload:
echo 'export PATH=$PATH:$HOME/go/bin' >> ~/.bashrc && source ~/.bashrc   # bash

TLS Certificate Setup

For TLS fingerprinting to work, the server must run in HTTPS mode. Place your certificate and key in the certs/ directory:

certs/
├── server.crt
└── server.key

Note: The certs/ directory is in .gitignore — certificates are not committed to the repository.

To generate a self-signed certificate for local development:

# Create certs directory
mkdir certs

# Generate self-signed certificate (valid for 1 year)
openssl req -x509 -newkey rsa:4096 -keyout certs/server.key -out certs/server.crt \
  -days 365 -nodes -subj "/CN=localhost"

Add the certificate to your system's trusted certificates for browser testing without warnings.

Using Let's Encrypt (certbot) — for a public hostname with HTTPS:

# Install certbot (Ubuntu/Debian)
sudo apt install certbot

# Obtain a certificate (standalone mode: port 80 must be free for the challenge)
sudo certbot certonly --standalone -d your-domain.example.com

# Certbot stores certs under /etc/letsencrypt/live/<domain>/
# Point the server at them via env or symlink into certs/:
#   TLS_CERT=/etc/letsencrypt/live/your-domain.example.com/fullchain.pem
#   TLS_KEY=/etc/letsencrypt/live/your-domain.example.com/privkey.pem
# Or copy/symlink into project certs/ (ensure deploy user can read; certbot files are root-readable):
sudo cp /etc/letsencrypt/live/your-domain.example.com/fullchain.pem certs/server.crt
sudo cp /etc/letsencrypt/live/your-domain.example.com/privkey.pem certs/server.key
sudo chown $(whoami) certs/server.crt certs/server.key

Renewal: certbot can renew via sudo certbot renew (e.g. from cron or systemd timer). After renewal, restart the Go server so it reloads the certs.

Development

# Build binary
task build

# Run server (HTTP mode, no TLS fingerprinting)
task run

# Run server with HTTPS (required for TLS fingerprinting)
task run:tls

# Run tests
task test

# Run linter
task lint

# Format code
task fmt

# Run all checks (fmt, lint, test)
task check

# List all available tasks
task --list

Build

# Build binary to bin/server
task build

# Or manually
go build -o bin/server ./cmd/server

# Run the binary
./bin/server

Optional: JA4 dictionary (deploy)

The server uses a JA4 fingerprint database (ja4db.com) for TLS vs User-Agent consistency. If the file is absent, the server downloads it itself on first use (saved to internal/fingerprint/data/ja4db.json when running from repo root). No manual step is required for basic runs.

For deployment, you can optionally download the dictionary manually (e.g. to avoid first-request latency or when the host has no outbound HTTPS):

# From repo root; creates internal/fingerprint/data/ja4db.json
curl -o internal/fingerprint/data/ja4db.json "https://ja4db.com/api/read/"

Or with PowerShell:

Invoke-WebRequest -Uri "https://ja4db.com/api/read/" -OutFile "internal/fingerprint/data/ja4db.json" -UseBasicParsing

Ensure the directory exists (mkdir -p internal/fingerprint/data or New-Item -ItemType Directory -Force -Path internal/fingerprint/data). Override path with env JA4DB_PATH if you place the file elsewhere.

Testing

# Run all tests
task test

# Run tests (short mode)
task test:short

# Test with curl (HTTP mode)
curl http://localhost:8080/

# Test with curl (HTTPS mode)
curl https://localhost:8443/

# Test health endpoint
curl http://localhost:8080/health
curl https://localhost:8443/health

Integration Tests

Run integration tests against a running server using curl:

# HTTP mode
task run                    # Start server (terminal 1)
task integration            # Run tests (terminal 2)

# HTTPS mode (TLS fingerprinting)
task run:tls                # Start HTTPS server (terminal 1)
task integration:tls        # Run tests with --insecure (terminal 2)

# Custom base URL
task integration BASE_URL=http://localhost:3000
task integration:tls BASE_URL=https://localhost:8443

Benchmark

Run HTTP performance benchmark against a running server. You can pass a URL to test different routes (e.g. /, /health, /debug).

# Start server
task run:tls                # HTTPS mode (terminal 1)

# Run benchmark (terminal 2)
task bench:tls              # Default URL: https://localhost:8443/, 10s, 10 concurrent

# Pass URL to test a specific path (variable or positional after --)
task bench:tls URL=https://localhost:8443/debug
task bench:tls -- https://localhost:8443/health

# Custom duration and concurrency
task bench:tls DURATION=30s CONCURRENCY=50

# HTTP mode (default URL: http://localhost:8080/)
task bench
task bench URL=http://localhost:8080/health DURATION=10s CONCURRENCY=10
task bench -- http://localhost:8080/

Benchmark output includes RPS, RPM, and latency statistics (avg/min/max).

The integration tests automatically detect the OS and use:

tools/shell/integration_test.ps1 for Windows (PowerShell)
tools/shell/integration_test.sh for Unix (Linux/macOS)

Tests verify:

GET /health — health check endpoint returns {"status":"ok"}
GET / — classify endpoint returns classification
GET /debug — debug endpoint returns fingerprint data
curl is correctly detected as bot

Endpoints

Endpoint	Description
`GET /`	Classify client as browser or bot
`GET /health`	Health check
`GET /debug`	Debug info with full fingerprint (dev only)

Example API response (GET /):

{
  "classification": "browser",
  "confidence": "0.95",
  "message": "You appear to be using a browser",
  "request_id": "uuid",
  "timestamp": "2026-02-18T12:00:00Z",
  "version": "0.10.0"
}

(confidence is a string with 2 decimal places to avoid float instability.)

Log Format

Each request is logged as one JSON line (JSONL) with full fingerprint data. Log files are written by day in UTC: logs/requests_YYYYMMDD.jsonl (e.g. logs/requests_20260217.jsonl). The server rotates to a new file automatically when the date changes.

{
  "timestamp": "2026-02-12T12:40:35Z",
  "request_id": "uuid",
  "classification": "browser",
  "confidence": 0.99,
  "fingerprint": {
    "tls": {
      "version": "TLS 1.3",
      "cipher_suites_count": 16,
      "extensions_count": 18,
      "ja3_hash": "9b0d79d10808bc0e509b4789f870a650",
      "ja4_hash": "t13d1516h2_8daaf6152771_d8a2da3f94cd",
      "supported_groups": ["GREASE", "x25519", "secp256r1", "secp384r1"]
    },
    "http": {
      "version": "HTTP/2.0",
      "header_count": 14
    }
  },
  "signals": {
    "browser_score": 18,
    "bot_score": 0,
    "score_breakdown": "BROWSER[http2(+2) sec-fetch(+3) ...] BOT[]"
  },
  "score": 18
}

Production deploy

You can run the service on Ubuntu as a systemd unit: one process listens on both HTTP and HTTPS, and restarts on failure or after a reboot.

1. Build the Linux binary

On your dev machine or in CI:

task build:prod

The binary will be at bin/server. Copy it to the server (e.g. /opt/go-client-classifier/).

2. Certificates

Place the certificate and key in the app directory, for example:

/opt/go-client-classifier/
├── server          # binary
├── certs/
│   ├── server.crt
│   └── server.key
└── logs/           # created automatically

3. systemd unit file

Create /etc/systemd/system/go-client-classifier.service:

[Unit]
Description=Go Client Classifier (bot detector)
After=network.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/go-client-classifier
ExecStart=/opt/go-client-classifier/bin/server
Restart=always
RestartSec=5

# HTTP :8080, HTTPS :8443
Environment=PORT=8080
Environment=TLS_PORT=8443
Environment=TLS_CERT=/opt/go-client-classifier/certs/server.crt
Environment=TLS_KEY=/opt/go-client-classifier/certs/server.key

# Optional: Redis — challenge store + behavioural metrics (Appendix L). If unset, challenge store is in-memory and metrics are not collected.
# Environment=REDIS_URL=redis://127.0.0.1:6379/0
# Environment=CHALLENGE_TTL_SEC=120

# Optional: enable PROXY protocol on TLS port (when nginx stream uses proxy_protocol on → real client IP in logs)
# Environment=PROXY_PROTOCOL=1

# Raise open-file limit (default 1024 can cause SSL/connection failures under load)
LimitNOFILE=65535

# Optional: disable request logging, only health/debug
# Environment=DEBUG=false

[Install]
WantedBy=multi-user.target

Redis (optional) — If you use REDIS_URL, ensure Redis is running. Check and install:

# Check: expect PONG
redis-cli ping

# Install if missing (Debian/Ubuntu)
command -v redis-server >/dev/null 2>&1 || sudo apt-get update && sudo apt-get install -y redis-server
# Or RHEL/Rocky/Fedora
command -v redis-server >/dev/null 2>&1 || sudo dnf install -y redis && sudo systemctl enable --now redis

# After install: ensure it runs and responds (service name: redis on RHEL/Ubuntu 24+, redis-server on older Debian/Ubuntu)
sudo systemctl enable --now redis 2>/dev/null || sudo systemctl enable --now redis-server
redis-cli ping

Replace User=deploy and Group=deploy with the user and group that should run the service. Ensure that user can read the binary, certs/, and write to logs/ (e.g. chown -R deploy:deploy /opt/go-client-classifier).

Alternatively, put variables in a file: create /opt/go-client-classifier/.env (or environment.conf) and add EnvironmentFile=/opt/go-client-classifier/.env to the unit.

4. Enable and start

sudo systemctl daemon-reload
sudo systemctl enable go-client-classifier
sudo systemctl start go-client-classifier
sudo systemctl status go-client-classifier

Verify: curl http://localhost:8080/health and curl -k https://localhost:8443/health.

Viewing logs in real time

Service output (stdout/stderr: startup message, per-request console line, errors):
```
journalctl -u go-client-classifier -f
```
Request log file (JSONL, one line per classify request):
```
tail -f /opt/go-client-classifier/logs/requests_$(date +%Y%m%d).jsonl
```
Or from the app directory: tail -f logs/requests_*.jsonl (today’s file).

Note: Any request that hits the classify handler (including non-root paths like /not-known) is classified and written to the JSONL and console logs; only GET / returns 200 JSON, other paths return 404. GET /health and GET /debug are handled by other handlers and are not logged. If the log stays empty, check journalctl -u go-client-classifier -f for the "Logs:" path at startup and any "Error logging result" messages.

Environment variables

Variable	Description	Example
`PORT`	HTTP port	`8080`
`TLS_PORT`	HTTPS port (when using TLS)	`8443`
`PROXY_PROTOCOL`	PROXY protocol on TLS (if nginx has `proxy_protocol on`)	`true`
`TLS_CERT`	Path to certificate file	`certs/server.crt`
`TLS_KEY`	Path to key file	`certs/server.key`
`DEBUG`	Enable `/debug` endpoint	`true` / `false`
`SCORING_CONFIG`	Path to scoring JSON (points, thresholds, classifier)	`config/scoring.json`

If only TLS_CERT and TLS_KEY are set (no TLS_PORT), the service runs in HTTPS-only mode on PORT.

Scoring config — All scoring points, thresholds, classifier weight and confidence parameters are read from a single JSON file at startup. Path: SCORING_CONFIG or default config/scoring.json. If the file is missing or invalid, built-in defaults are used. Tuning (e.g. reducing false bots for incognito) is done via the config without code changes. See config/README.md for the schema, smoking guns (+3), strong/weak bot signals, and zero-point (easily spoofable) signals; config/scoring.default.json is the reference default.

Research Questions

Can transport-level signals reliably distinguish browsers from automation?
Which signals are most predictive?
How do sophisticated bots (headless Chrome) behave?
What are the false positive/negative rates?

Pre-commit Hooks

Project uses git pre-commit hooks for code quality:

Format check (go fmt)
Linter (golangci-lint)
Tests (go test)

Hooks are automatically run before each commit.

Documentation

CHANGELOG.md — version history and release notes
config/README.md — scoring config schema, smoking guns, weak/zero signals, thresholds
docs/METHODOLOGY.md — research methodology, signals, scoring algorithm, references; Appendix J — request log statistics and collection methodology
docs/nginx.md — nginx setup for TLS termination, HTTP/2 fingerprint (X-FP-H2), JA3 (X-FP-JA3); Go consumes headers and uses H2/JA3 in cross-validation (Appendix G)
tools/python/README.md — Python tools: request_log_stats (aggregate JSONL), antibot_test

License

MIT (Academic Research)

Authors

Research project for academic purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bot Detector Research Project

Performance Highlights

Project Goal

Current Status

Architecture

Tech Stack

Project Structure

Classification Signals

TLS Level

HTTP Level

Research Workflow

Getting Started

Prerequisites

Installation

TLS Certificate Setup

Development

Build

Optional: JA4 dictionary (deploy)

Testing

Integration Tests

Benchmark

Endpoints

Log Format

Production deploy

Research Questions

Pre-commit Hooks

Documentation

License

Authors

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
cmd/server		cmd/server
config		config
docs		docs
internal		internal
logs		logs
tests		tests
tools		tools
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
README.md		README.md
Taskfile.yml		Taskfile.yml
go.mod		go.mod
go.sum		go.sum
tools.go		tools.go

muliwe/go-client-classifier

Folders and files

Latest commit

History

Repository files navigation

Bot Detector Research Project

Performance Highlights

Project Goal

Current Status

Architecture

Tech Stack

Project Structure

Classification Signals

TLS Level

HTTP Level

Research Workflow

Getting Started

Prerequisites

Installation

TLS Certificate Setup

Development

Build

Optional: JA4 dictionary (deploy)

Testing

Integration Tests

Benchmark

Endpoints

Log Format

Production deploy

Research Questions

Pre-commit Hooks

Documentation

License

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages