DLP Reference

Complete reference for Grob's Data Loss Prevention engine. All configuration options, detection rules, actions, and runtime behavior.

Configuration

All DLP settings live under the [dlp] table in grob.toml.

Top-level settings

[dlp]
enabled = true         # Master switch (default: false)
scan_input = true      # Scan outgoing requests (default: true when enabled)
scan_output = true     # Scan incoming responses (default: true when enabled)
no_builtins = false    # Disable all 25 built-in secret rules (default: false)
rules_file = ""        # Path to external TOML rules file (optional)
enable_sessions = false # Per-API-key DLP isolation (default: false)

Secret rules

Secret rules use regex patterns with a prefix gate for fast rejection. The prefix triggers pattern evaluation; only text containing the prefix is tested against the regex.

[[dlp.secrets]]
name = "internal_token"     # Human-readable identifier
prefix = "itk_"             # Literal prefix that triggers evaluation
pattern = "itk_[A-Za-z0-9]{40}"  # Regex to match the full secret
action = "canary"           # canary | redact | log

Actions:

Action	Behavior
`canary`	Replaces with a syntactically valid fake containing `~CANARY` and a monotonic ID. Format-preserving: maintains prefix and length. Canaries cannot re-match the original pattern (the `~` character breaks `[A-Za-z0-9]` patterns).
`redact`	Alias for `canary` on secret rules. A canary token is always generated (never a plain `[REDACTED]`) because format-preservation is strictly better for secrets: it avoids breaking the surrounding context and enables downstream exfiltration detection.
`log`	Logs the detection without modifying the text.

Custom prefix rules

For vendor-specific tokens where you know the prefix and total length but do not need a full regex.

[[dlp.custom_prefixes]]
name = "vault_token"
prefix = "v1.AA"
length = 32           # Expected total length including prefix
action = "canary"

The engine auto-generates a regex: {escaped_prefix}[A-Za-z0-9]{remaining_length}.

Name anonymization

[[dlp.names]]
term = "Thales"        # Exact term to detect (case-insensitive)
action = "pseudonym"   # pseudonym | redact | log

Actions:

Action	Behavior
`pseudonym`	Replaces with an HMAC-derived pseudonym (e.g. `Cobalt-Falcon-a3f2`). The mapping is reversed on the response path. Set `GROB_DLP_SECRET` env var for stable pseudonyms across restarts.
`redact`	Replaces with `[NAME]`.
`log`	Logs the detection without modification.

Pseudonyms are generated from 64 adjectives and 64 nouns (4096 combinations) plus a hex suffix for collision resistance. The forward mapping uses Aho-Corasick for O(n) multi-pattern matching.

PII detection

[dlp.pii]
credit_cards = true    # Luhn-validated card numbers (default: true)
iban = true            # ISO 13616 mod-97 validated IBANs (default: true)
bic = false            # BIC/SWIFT codes (default: false, higher false-positive rate)
action = "redact"      # redact | log

Redaction labels: [CARD REDACTED], [IBAN REDACTED], [BIC REDACTED].

Entropy analysis

[dlp.entropy]
enabled = false        # Default: disabled
action = "log"         # log | alert

Uses a Sequential Probability Ratio Test (SPRT) on Shannon entropy. Thresholds:

Natural English: ~3.5-4.5 bits/byte
Base64: ~5.5-6.0 bits/byte
Random: ~7.5-8.0 bits/byte
Detection threshold: 5.5 bits/byte

Runs asynchronously after stream completion. Never blocks the response path.

URL exfiltration detection

[dlp.url_exfil]
enabled = false                     # Default: disabled
action = "log"                      # redact | log | block
scan_markdown_images = true         # ![](url) patterns
scan_markdown_links = true          # [text](url) patterns
scan_raw_urls = true                # Bare http/https URLs
flag_long_query_params = true       # Query strings > max_query_length
flag_base64_in_path = true          # Base64 segments in URL paths
flag_data_uris = true               # data: URI scheme
max_query_length = 200              # Byte threshold for long query flagging
whitelist_domains = []              # Allowed domains (if set, only these pass)
blacklist_domains = []              # Blocked domains
domain_match_mode = "suffix"        # exact | suffix | glob

Domain matching modes:

Mode	Behavior	Example
`exact`	Hostname must equal the entry	`github.com` matches only `github.com`
`suffix`	Hostname ends with the entry	`github.com` matches `api.github.com`
`glob`	Wildcard patterns	`*.github.com` matches `api.github.com` but not `github.com`

When whitelist_domains is non-empty, any domain NOT in the whitelist is flagged. Whitelist takes precedence over blacklist.

Prompt injection detection

[dlp.prompt_injection]
enabled = false                     # Default: disabled
action = "log"                      # redact | log | block
no_builtins = false                 # Disable built-in patterns
custom_patterns = []                # User-defined regex patterns
languages = ["all"]                 # Language filter: ["all"] or ["en", "fr", "zh", ...]

28 languages supported: English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Chinese, Japanese, Korean, Turkish, Arabic, Hindi, Vietnamese, Thai, Indonesian, Malay, Swedish, Norwegian, Danish, Finnish, Czech, Romanian, Hungarian, Greek, Esperanto.

Anti-obfuscation pipeline (applied before pattern matching):

Strip invisible Unicode characters (zero-width spaces, joiners, directional marks, BOM)
NFKC normalization
Homoglyph mapping (Cyrillic, Greek, fullwidth, mathematical variants to Latin)
Whitespace collapse
Leet speak decoding (aggressive pass, only if standard pass found nothing)

Text normalization is cached with moka (2048 entries, 5-minute TTL).

Signed config hot-reload

[dlp.signed_config]
enabled = false
source = "~/.grob/dlp-rules.toml"  # File path or HTTPS URL
poll_interval = "1h"               # Polling interval (e.g. "30m", "6h")
verify_signature = false           # Require ECDSA P-256 signature
public_key_path = ""               # Path to PEM or raw SEC1 public key
detached_sig_suffix = ".sig"       # Suffix for detached signature files

Hot-reloadable fields: url_exfil.whitelist_domains, url_exfil.blacklist_domains, prompt_injection.custom_patterns. Changes are SHA-256 checksummed; unchanged content is skipped.

External rules file

[dlp]
rules_file = "~/.grob/dlp-rules.toml"

External file format (no [dlp] wrapper):

[[secrets]]
name = "custom_token"
prefix = "ctk_"
pattern = "ctk_[A-Za-z0-9]{32}"
action = "redact"

[[names]]
term = "Acme Corp"
action = "pseudonym"

Session isolation

[dlp]
enable_sessions = true

When enabled, each unique API key (or JWT tenant_id) gets its own NameAnonymizer with different pseudonyms and an independent CanaryGenerator counter. API keys are SHA-256 hashed for session identification. Session engines are created lazily and cached.

Built-in rules

25 rules ship by default (disable with no_builtins = true):

Rule	Prefix	Family
`openai_api_key`	`sk-proj-`	llm
`anthropic_api_key`	`sk-ant-api03-`	llm
`huggingface_token`	`hf_`	llm
`perplexity_api_key`	`pplx-`	llm
`gcp_api_key`	`AIza`	generic
`vault_token`	`hvs.`	generic
`stripe_secret_key`	`sk_`	stripe
`stripe_restricted_key`	`rk_live_`	stripe
`sendgrid_api_key`	`SG.`	stripe
`github_pat_v2`	`github_pat_`	github
`github_pat`	`ghp_`	github
`github_oauth`	`gho_`	github
`github_app`	`ghs_`	github
`gitlab_pat`	`glpat-`	gitlab
`npm_token`	`npm_`	generic
`slack_bot_token`	`xoxb-`	generic
`slack_user_token`	`xoxp-`	generic
`aws_access_key`	`AKIA`	aws
`jwt_token`	`eyJ`	jwt
`rsa_private_key`	`-----BEGIN RSA PRIVATE KEY-----`	pem
`openssh_private_key`	`-----BEGIN OPENSSH PRIVATE KEY-----`	pem
`ec_private_key`	`-----BEGIN EC PRIVATE KEY-----`	pem
`generic_private_key`	`-----BEGIN PRIVATE KEY-----`	pem
`postgres_uri`	`postgres://`	database
`mongodb_uri`	`mongodb://`	database

All built-in rules use the redact action.

Scanning pipeline

The following diagram illustrates the full DLP scan pipeline from request ingestion to client response.

flowchart TB
    req["Incoming Request"] --> secrets["Secret Scanner<br/>(25 builtin rules, Aho-Corasick DFA)"]
    secrets -->|"found"| action1{"Action?"}
    action1 -->|redact| redact["Replace with [REDACTED]<br/>+ inject canary token"]
    action1 -->|block| block["Return 400"]
    action1 -->|warn| warn["Log + continue"]
    secrets -->|"clean"| pii["PII Scanner<br/>(email, phone, credit card, IBAN)"]
    pii --> names["Name Pseudonymizer<br/>(reversible HMAC mapping)"]
    names --> injection["Injection Detector<br/>(28 languages)"]
    injection -->|"detected"| block2["Block or warn"]
    injection -->|"clean"| provider["Forward to provider"]
    provider --> resp["Response"]
    resp --> url["URL Exfiltration Scanner"]
    url --> entropy["Entropy Detector (SPRT)"]
    entropy --> client["Return to client"]

Request path (input)

Prompt injection detection (if enabled, action = block short-circuits)
Name anonymization (real names to pseudonyms)
Secret scanning (DFA prefix gate, then regex)
PII scanning (credit cards, IBAN, BIC with mathematical validation)

Response path (output)

Name de-anonymization (pseudonyms back to real names)
Secret scanning (catches LLM-generated secrets)
PII scanning (catches LLM-generated PII)
URL exfiltration scanning (Markdown images/links, data URIs)

Streaming path

SSE stream chunks are intercepted by the DlpStream adapter:

Zero-copy passthrough: Chunks without content_block_delta events are forwarded unchanged (SIMD-accelerated memchr::memmem check).
Token-length EMA pre-filter: An exponential moving average of per-delta text lengths skips DFA scanning when tokens are long (normal prose). Short BPE fragments trigger scanning.
Canary circuit breaker: After 20 secret detections in one stream, canary generation switches to [REDACTED] to prevent canary flooding.
Cross-chunk detection: The full response is accumulated and scanned at end-of-stream to catch secrets split across SSE deltas.
SPRT buffer: Bounded at 4 KB; accumulated per-stream. The SPRT scanner operates token-by-token (whitespace-split) rather than a raw byte sliding window.
URL exfil block: If a block-action URL exfiltration is detected, the stream is terminated with an event: error SSE event.

Performance

Prefix byte filter: O(n) single-byte lookup rejects text without any known prefix start byte (~90% rejection rate on clean text).
Aho-Corasick confirmation: For short texts (<=512 bytes), a multi-pattern AC automaton confirms a full prefix string exists before running regexes (~99% rejection).
Lazy regex compilation: Pattern syntax is validated at startup via regex_syntax::parse (cheap). Full DFA compilation is deferred to the first scan via OnceLock.
Name matching: Aho-Corasick automaton for both forward (anonymize) and reverse (de-anonymize) directions.
PII pre-filter: Byte-level scan for digit/uppercase runs before invoking regex.

Metrics

All metrics use the grob_dlp_ prefix:

Metric	Type	Labels	Description
`grob_dlp_rules_loaded`	gauge	`type`	Number of loaded rules by type (`secret`, `name`)
`grob_dlp_detections_total`	counter	`type`, `rule`, `action`	Detection events
`grob_dlp_cross_chunk_total`	counter	`rule`	Cross-chunk detections at end-of-stream
`grob_dlp_stream_blocked_total`	counter	—	Streams terminated by URL exfil block
`grob_dlp_circuit_breaker_total`	counter	—	Canary circuit breaker activations
`grob_dlp_hot_reload_total`	counter	`status`	Hot-reload outcomes (`success`, `unchanged`, `failed`, `sig_failed`)
`grob_dlp_signature_verified_total`	counter	`result`	Signature verification results
`grob_dlp_config_hash_info`	gauge	`hash`	Current config hash (first 16 hex chars)

Environment variables

Variable	Purpose
`GROB_DLP_SECRET`	HMAC key for deterministic pseudonym generation. When unset, a random key is generated per process (pseudonyms differ across restarts).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLP Reference

Configuration

Top-level settings

Secret rules

Custom prefix rules

Name anonymization

PII detection

Entropy analysis

URL exfiltration detection

Prompt injection detection

Signed config hot-reload

External rules file

Session isolation

Built-in rules

Scanning pipeline

Request path (input)

Response path (output)

Streaming path

Performance

Metrics

Environment variables

FilesExpand file tree

dlp.md

Latest commit

History

dlp.md

File metadata and controls

DLP Reference

Configuration

Top-level settings

Secret rules

Custom prefix rules

Name anonymization

PII detection

Entropy analysis

URL exfiltration detection

Prompt injection detection

Signed config hot-reload

External rules file

Session isolation

Built-in rules

Scanning pipeline

Request path (input)

Response path (output)

Streaming path

Performance

Metrics

Environment variables