Skip to content

gregdiy/cyber_simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enterprise Attack Simulator

AI-Powered Synthetic Security Log Dataset + Benchmark for Detection Engineering


What This Is

A realistic enterprise security log dataset and reproducible detection benchmark featuring:

  • 7,920,291 total logs over 25 days of continuous enterprise activity
  • Multi-user pivot attack campaign (initial compromise to higher-privilege user to admin account abuse)
  • 133 users across departments with role-based behavior
  • 55 service accounts with authentic high-volume background activity
  • Defense observability logs (EDR / DLP / SIEM / PAM / MFA-style) at SOC volume:
    • 142,184 defense observability events total
    • 342 attacker actions (ground truth)
    • 269 attack-triggered observability events (not de-duplicated; one action can trigger multiple alert types)
  • Row-level ground truth: attacker actions are labeled by attack_id (non-null)

How it's generated: both benign and attack activity are generated by an AI-driven simulation framework. The attacker is guided by a 100+ parameter defensive environment (visibility, controls, enforcement, noise, alert overlap) and optimizes its campaign behavior under that posture.

Built by an ML engineer working in cybersecurity who hit the same wall most teams hit: no realistic, labeled enterprise data for training and validating detection systems.


Table of Contents


Background & Motivation

Security teams face a data challenge.

Training detection systems requires labeled attack data, but real breaches are rare, sensitive, and can't be freely shared. Most teams resort to:

  • Static datasets from years ago (DARPA 2000, CICIDS)
  • Lab exercises with clean, compressed attack scenarios
  • Limited red team engagements that can't run continuously

Validating detections is difficult without realistic test data that mirrors actual enterprise environments, complete with background noise, service account activity, and the tool overlap that makes real attacks hard to detect.

This dataset aims to help by providing enterprise-scale, multi-week, high-noise logs with event-level ground truth, plus a standard evaluator so teams can compare detections apples-to-apples.


Scenario at a Glance

This release models an intermediate-skill living-off-the-land campaign with a multi-identity pivot:

Identity Context Attacker Actions
daniel.davis004 on WS-SAL-0005 Initial compromised user (Sales) 136
daniel.thomas070 on WS-IT-0071 Pivot target (IT, normal account) — attacker + legitimate IT activity mixed 142
daniel.thomas070_admin on WS-IT-0071 Privilege context (IT, admin account) — admin-like attacker actions blended with normal admin work 23

Hosts touched by attacker actions:

Host Attacker-Action Events
WS-IT-0071 165
WS-SAL-0005 136
DB-SRV-02 24
APP-SRV-02 11
WEB-SRV-01 3
DC-01 3

Ground truth: all rows where attack_id is non-null.


Repository Contents

├── data/
│   └── two_day_sample_cyber_simulator_json_format.zip   # 2-day JSON sample (included)
├── notebooks/
│   └── explore_dataset.ipynb                            # dataset exploration
├── docs/
│   └── SCHEMA.md                                        # complete field documentation
├── evaluate.py                                          # benchmark evaluator
└── README.md

Full Dataset (External):

  • JSON Format (HuggingFace): zipped JSONL files split by day, see Quick Start

Quick Start

Option 1: Two-day JSON sample (included in repo)

The repo includes a small, ready-to-use two-day JSON sample at data/two_day_sample_cyber_simulator_json_format.zip.

import json
import zipfile

records = []

with zipfile.ZipFile("data/two_day_sample_cyber_simulator_json_format.zip") as zf:
    for name in zf.namelist():
        if name.endswith(".json"):
            with zf.open(name) as f:
                for line in f:
                    line = line.decode("utf-8").strip()
                    if line:
                        records.append(json.loads(line))

print("Loaded records:", len(records))

# Attacker actions = non-null attack_id
attack = [r for r in records if r.get("attack_id") is not None]
print("Attack records:", len(attack))

Option 2: Full dataset (JSON, split by day — HuggingFace)

curl -L -o data/cyber_simulator_json_format.zip \
  https://huggingface.co/datasets/gregalr/cyber_simulation_json_format/resolve/main/cyber_simulator_json_format.zip
import json
import zipfile

records = []

with zipfile.ZipFile("data/cyber_simulator_json_format.zip") as zf:
    for name in zf.namelist():
        if name.endswith(".json"):
            with zf.open(name) as f:
                for line in f:
                    line = line.decode("utf-8").strip()
                    if line:
                        records.append(json.loads(line))

print("Loaded records:", len(records))

Tip: For the full dataset, consider streaming into BigQuery / DuckDB / ClickHouse instead of loading all records into memory.


Evaluate detections (benchmark)

Scoring definition

All benchmark metrics are computed only on Windows attacker-action telemetry:

  • Eligible rows: log_type == "windows_security_event"
  • Positive label: attack_id is present (non-null / not "NA") within eligible rows
  • All other log types (DLP blocks, SIEM alerts, EDR alerts, PAM denials, etc.) are included for context but are not scored

Submission format

Submit a file listing the row_id values you flagged as suspicious:

  • row_id is the 0-based row number in the canonical dataset ordering (header excluded)
  • Do not shuffle or re-sort the dataset before generating row_ids
  • Alerts outside the eligible universe are ignored

Accepted formats:

  • CSV/TSV with a row_id column
  • JSON list of integers: [12, 19, 204, ...]
  • Plain text: one integer row_id per line

Example: create a submission file

Important: when working from the JSON zip, generate row_id using a deterministic order — iterate zip members in sorted(zf.namelist()) order, then iterate lines in file order.

import csv
import json
import zipfile

row_id = 0
alert_row_ids = []

with zipfile.ZipFile("data/cyber_simulator_json_format.zip") as zf:
    for name in sorted(zf.namelist()):
        if name.endswith(".json"):
            with zf.open(name) as f:
                for line in f:
                    obj = json.loads(line.decode("utf-8"))
                    cmd = (obj.get("command_line") or "").lower()

                    # Replace this rule with your detector/model
                    if obj.get("log_type") == "windows_security_event" and "invoke-command" in cmd:
                        alert_row_ids.append(row_id)

                    row_id += 1

with open("submission.csv", "w", newline="") as out:
    w = csv.writer(out)
    w.writerow(["row_id"])
    for rid in alert_row_ids:
        w.writerow([rid])

print("Wrote submission.csv with", len(alert_row_ids), "alerts")

Run evaluation

python evaluate.py --data data/cyber_simulator_json_format.zip --pred submission.csv --out metrics.json

The evaluator outputs precision/recall/F1, FP/day, recall by stage/technique, and time-to-detect.


Attack Chain: living_off_land_basic

Type: living-off-the-land (PowerShell / remote execution / discovery / lateral movement / exfiltration)

Key realism feature: the attacker pivots from a low-privilege endpoint to IT context, then abuses a separate admin account.

High-level phases:

  1. Initial compromise + foothold (Sales workstation)
  2. Discovery + pivot attempts (IT workstation context)
  3. Admin-account abuse (dual-account realism)
  4. Server access + staging + exfiltration

Attacker actions are labeled via attack_id (non-null), stage_number (0–15).


Why This Attack Is Hard to Detect

Privilege transitions are the hard part. The pivot from Sales to IT to IT Admin puts attacker behavior inside the legitimate admin "shape."

Tool overlap. PowerShell and remote admin behaviors are normal for IT, so signatures alone fail.

SOC-like alert noise. Defense observability exists at high volume, with overlapping alert types and duplicates.

Living-off-the-land. No malware required—mostly built-in tooling and normal protocols.


Dataset Statistics

Metric Value
Total logs 7,920,291
Defense observability logs 142,184
Attacker actions (attack_id non-null) 342 (~0.004% of total)
Attack-triggered observability events 269 (not de-duplicated)
Duration 25 days
Users 133
Service accounts 55
Pivot identities 3 (Sales user to IT user to IT admin)
Hosts touched by attacker 6
Attack stages 16

Schema Overview

Core Fields

Field Description
timestamp ISO 8601: "2025-12-21T01:32:03.000-08:00"
log_type "windows_security_event", "defender_atp_alert", "pam_access_denied", ...
user Human identity
account Security principal (user or svc_*)
hostname Device: "WS-IT-0071", "DB-SRV-02", ...
device_type workstation, database_server, domain_controller, ...
location NYC_HQ, SF_Office, London, Remote_VPN, ...
department Sales, IT, Finance, ...

Activity Fields

Field Description
process_name "powershell.exe", "cmd.exe", ...
command_line Full command with arguments
event_type process_start, network_connection, file_access, ...
source_ip Internal: 10.x.x.x, VPN: 192.168.x.x
destination_ip Internal or external
port 135, 443, 3389, 8443, ...
protocol TCP, UDP, HTTPS, RDP, ...

Attack Labels

Field Description
attack_id "ATK_70246" or null
attack_type MITRE technique (optional per-row) or null
stage_number "0" through "15" or null

Defense Observability Fields

Field Description
severity low, medium, high, critical
action_taken blocked, logged, quarantined, denied
vendor Defense product vendor (if present)
detection_confidence 0.01.0 (if present)
alert_name e.g. "Suspicious PowerShell Activity"
reason e.g. "Unauthorized service account access attempt"

See docs/SCHEMA.md for complete documentation.


How This Dataset Differs

This dataset is built to test detections under realistic enterprise conditions:

  • Multi-user pivot across privilege boundaries (Sales to IT to IT Admin)
  • Dual-account admin modeling (normal + admin accounts)
  • SOC-scale observability noise (alerts/blocks/denials with overlap)
  • Row-level ground truth + evaluator so detectors can be compared reproducibly

Dataset Scope

This is a static, synthetic dataset representing a single high-fidelity campaign. It is designed to be repeatable and shareable, and to support:

  • Detection benchmarking (rules + ML)
  • SOC analyst training and investigation drills
  • Research on alert noise, privilege transitions, and pivot detection

License & Attribution

Project: Phantom Armor — Enterprise Attack Simulator
Author: Greg Rothman
Contact: gregralr@phantomarmor.com

All data is fully synthetic. No real users, systems, or organizations are represented.

Citation

@dataset{phantom_armor_2026,
  author    = {Rothman, Greg},
  title     = {Enterprise Attack Simulator: AI-Powered Synthetic Security Log Dataset and Detection Benchmark},
  year      = {2026},
  publisher = {Phantom Armor},
  url       = {https://github.com/gregdiy/cyber_simulation}
}

Community

  • Issues: Report bugs or request features via GitHub Issues
  • Discussions: Share detection techniques and ask questions
  • Contributions: PRs welcome for notebooks and analysis scripts

Acknowledgments

Motivated by real-world gaps encountered while building ML-based detection systems in enterprise security operations. Thanks to the security research community for public threat intelligence and documentation that informed the modeled tradecraft.

About

Synthetic cybersecurity log simulator for research, SOC automation testing, and machine learning benchmarking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages