Skip to content

Latest commit

 

History

History
139 lines (94 loc) · 8.03 KB

File metadata and controls

139 lines (94 loc) · 8.03 KB

Python tools

Helper Python scripts for the go-client-classifier project (antibot bypass tests, request-log statistics, dashboard payload, and manual assessment sampling).

Requirements

Setup (deploy environment)

From the repo root:

cd tools/python
poetry install

Or from any directory:

poetry install --directory tools/python

After that the environment with dependencies is ready to use.

Running scripts

Run all commands from tools/python after poetry install, or via poetry run:

cd tools/python
poetry run python antibot_test.py

Or activate the shell and run scripts as usual:

cd tools/python
poetry shell
python antibot_test.py

Linters (pre-commit)

From tools/python you can run the same checks as pre-commit (task check, trailing whitespace, end-of-file-fixer, check-yaml, check-added-large-files, black, isort, mypy, autoflake, ruff) over the whole repo:

cd tools/python
poetry run lint

This runs pre-commit from the repo root, so the result is identical to pre-commit run --all-files in the root. Config for black and isort: repo root pyproject.toml and tools/python/pyproject.toml (shared line_length and isort profile = "black" so they don’t conflict).

Contents

  • antibot_test.py — antibot detection bypass check via curl_cffi (TLS/HTTP2 fingerprint as Chrome/Safari). Dependency: curl-cffi.

  • build_dashboard_payload.py — builds the dashboard JSON consumed by the TS dashboard (tools/ts/dashboard). Reads JSONL request logs (same format as request_log_stats.py: classification, timestamp, signals.score_breakdown, optional request_metrics for behavioural signals). Output: windows (hour, day, week, month, all), timeline (fixed 60 bars; granularity 10 s / 1 min / 10 min chosen by median total per bar—if median < threshold then 1 min, then 10 min so bars stay visible when traffic is sparse), signals (transport signals from score_breakdown + behavioural: req_per_min, gap_median, gap_std_mean, gap_mean_median), plus timeline_bucket_sec and timeline_window_sec for the UI. Records without timestamp are skipped.

    poetry run python build_dashboard_payload.py "logs/**/requests_*.jsonl"
    poetry run python build_dashboard_payload.py -o dashboard.json "logs/**/*.jsonl"
    poetry run python build_dashboard_payload.py --timeline-minutes 10 "logs/**/requests_*.jsonl"
    poetry run python build_dashboard_payload.py --progress "logs/**/requests_*.jsonl"

    Options: -o / --output — output file (default: stdout); --timeline-minutes — timeline window in minutes for the initial 10 s granularity (default: 10); --progress — show tqdm progress bar (default: simple stderr log “Reading N file(s)…” / “Read M records.”).

  • request_log_stats.py — statistics over request logs (JSONL) for bot detection methodology: top-N by fields (path, method, IP, user_agent, accept, JA3/JA4/JA4H, headers), bot/browser split, scoring signal prevalence, global summary (unique IPs/URLs). Metrics in the spirit of Cloudflare Signals Intelligence; optional significance filter (√N). Accounts for delivery channels (docs/nginx.md); unified interpretation behind proxy (signals.is_http2, fingerprint.tls). Details: docs/METHODOLOGY.md, Appendix J (Request log statistics and collection methodology).

    Run (from tools/python or repo root):

    poetry run python request_log_stats.py -n 20 "logs/**/requests_*.jsonl"
    poetry run python request_log_stats.py -n 10 -o report.txt --format text "logs/**/*.jsonl"
    poetry run python request_log_stats.py --format json -o stats.json "logs/**/requests_*.jsonl"

    Options: -n / --top — number of top values per field (default 15); -o — output file; --format text|json; --sort count|discriminative; --exclude-stress-tests — exclude go-http-client; --no-significance-filter — disable significance filter (√N). Record format: one JSON per line (tests/testdata/reference_browser.json).

  • request_log_stats_by_class.py — same statistics as above but by group: all (optionally excluding stress tests), bot, browser. Input: one or more globs for JSON (single array of objects) or JSONL (one object per line). Output: file or stdout, text or JSON. Reuses aggregation and formatting from request_log_stats.py.

    poetry run python request_log_stats_by_class.py -n 15 "logs/**/requests_*.jsonl"
    poetry run python request_log_stats_by_class.py -o report.txt "logs/**/*.json" "logs/**/*.jsonl"
    poetry run python request_log_stats_by_class.py --format json -o stats.json "logs/requests.jsonl"
    poetry run python request_log_stats_by_class.py --exclude-stress-tests "logs/**/requests_*.jsonl"

    Options: same as request_log_stats.py plus --no-progress. Text output: three sections (ALL, BOT, BROWSER). JSON output: {"all": {...}, "bot": {...}, "browser": {...}}.

  • behavioral_bars.py — splits values of four behavioural metrics (request_rate_per_min, inter_arrival_median_sec, inter_arrival_std_per_mean, inter_arrival_mean_median_ratio) from request_metrics.ip_derived into 99 percentile bars (p01–p99). For each bar it outputs: total row count, rows classified as browser, as bot, bot−browser, and (bot−browser)/(bot+browser). Optionally generates and saves charts per parameter with the current edge threshold marked (METHODOLOGY Appendix M).

    poetry run python behavioral_bars.py "logs/requests.jsonl"
    poetry run python behavioral_bars.py -o report.json --charts-dir ./charts "logs/**/*.jsonl"
    poetry run python behavioral_bars.py --p-from 5 --p-to 95 "logs/requests.jsonl"
    poetry run python behavioral_bars.py --req-per-min 2.0 --gap-median-sec 4.0 "logs/**/requests_*.jsonl"

    Options: -o — output JSON (default: stdout); --charts-dir — directory for PNG charts; --p-from, --p-to — percentile bar range, 1-based (default: 1 and 99, i.e. p01–p99); --no-progress; --req-per-min, --gap-median-sec, --gap-std-mean, --gap-mean-median — edge thresholds for display on charts.

  • sample_assessment.py — builds a random representative sample from request JSONL for manual FP/FN assessment. Excludes IPs in the top-10 and bottom-10 by request count and IPs with fewer than 2 requests; randomly selects 100 bot-labeled IPs and 100 browser-labeled IPs (configurable), then for each IP outputs the first 10 requests by time with: time, delta (ms) from previous request, classification, url, client, cookies, referrer. Console output is human-readable; -o writes full JSON. See METHODOLOGY Appendix O.

    poetry run python sample_assessment.py "logs/requests.jsonl"
    poetry run python sample_assessment.py -o sample.json --json "logs/requests.jsonl"
    poetry run python sample_assessment.py --bot-n 50 --browser-n 50 --seed 42 "logs/**/requests_*.jsonl"

    Options: -o / --output — write full result JSON; --json — print JSON to stdout; --bot-n, --browser-n — number of IPs to sample per class (default 100 each); --seed — random seed for reproducibility.

Dashboard payload (build_dashboard_payload.py)

The script is the recommended way to produce dashboard.json for the TS dashboard. Place the output in tools/ts/dashboard/public/dashboard.json for local dev, or serve it from the same origin (or set VITE_DASHBOARD_JSON_URL at build time). Cron example:

cd tools/python
poetry run python build_dashboard_payload.py -o /var/www/dashboard.json "logs/**/requests_*.jsonl"

Dependencies

Managed via Poetry, see pyproject.toml. Main ones: curl-cffi, pandas, numpy, matplotlib (for behavioral_bars.py charts).

Adding a new dependency:

cd tools/python
poetry add <package>

Updating the lock file after editing pyproject.toml:

poetry lock
poetry install