FakeSmith

Generate realistic fake data that mirrors your real data's shape — safe to share with LLMs, teammates, or in public repos.

A Python package and CLI that converts real configs, payloads, logs, and datasets into schema-preserving synthetic versions safe to share with LLMs. Because LLM-safe sanitization of real developer artifacts is a real and growing workflow problem.

When you share code with an AI assistant, you shouldn't have to expose real emails, API keys, card numbers, or user data. FakeSmith lets you describe (or just paste) a sample of your data and instantly get structurally identical but completely fake replacements.

Install

pip install fakesmith

Quick Start

Option 1 — Auto-detect from a sample

from fakesmith import FakeSmith

# Paste a real (or representative) sample — FakeSmith reads its shape
sample = '''[{
    "user_id": "3f2e1a4b-0000-0000-0000-000000000000",
    "email": "john.doe@company.com",
    "phone": "+1-800-555-0199",
    "api_key": "sk-abc123def456ghi789jkl012",
    "amount": 199.99,
    "status": "active",
    "created_at": "2024-01-15T09:30:00"
}]'''

smith = FakeSmith.from_sample(sample)
smith.describe()          # see what was detected
print(smith.to_json(5))   # 5 fake records, same shape

from fakesmith import FakeSmith, SchemaField, FieldType

smith = FakeSmith([
    SchemaField("user_id",  FieldType.UUID),
    SchemaField("email",    FieldType.EMAIL),
    SchemaField("name",     FieldType.FULL_NAME),
    SchemaField("amount",   FieldType.AMOUNT, min_value=10, max_value=5000),
    SchemaField("status",   FieldType.STATUS, choices=["active", "inactive", "pending"]),
    SchemaField("api_key",  FieldType.API_KEY, prefix="sk-live-"),
])

# Generate deterministic records with a seed
result = smith.generate(10, seed=42)
result.print_summary()  # See which fields were faked
records = result.records # Access the list of dicts

Option 3 — Quick dict shorthand

smith = FakeSmith.from_dict({
    "id":       FieldType.UUID,
    "email":    FieldType.EMAIL,
    "score":    FieldType.INTEGER,
    "verified": FieldType.BOOLEAN,
})

Output Formats

smith.to_json(10)                          # JSON string
smith.to_csv(10)                           # CSV string
smith.to_sql(10, table_name="users")       # SQL INSERT statements
smith.to_env()                             # .env file format

smith.save_json("fake_users.json", 100)    # save to file
smith.save_csv("fake_users.csv",  100)
smith.save_sql("seed.sql",        100, table_name="users")
smith.save_env(".env.fake")

CLI

# Generate 20 fake records from a JSON sample
fakesmith generate --file real_sample.json --count 20 --format json

# From CSV, output as SQL inserts
fakesmith generate --file data.csv --count 50 --format sql --table transactions

# Deterministic output using a seed
fakesmith generate --file data.json --seed 42 --out fake_data.json

# Sanitize raw text (log lines, configs) in-place
fakesmith sanitize --file server.log --out clean.log --summary

# Inspect detected schema and sensitivity flags
fakesmith describe --file data.json

In-place Sanitization

FakeSmith can scan raw text (log lines, configuration blocks, or emails) and replace PII/secrets in-place without needing a schema.

from fakesmith import sanitize_text

raw_text = "My email is alex@example.com and my key is sk-12345"
result = sanitize_text(raw_text, seed=42)

print(result.sanitized)
# "My email is fake.user@domain.com and my key is sk-a1b2c3d4..."

result.print_summary() # See exactly what was replaced and why

Run the Samples

Try out FakeSmith on the included sample datasets (JSON, CSV, and .env) using the demo script:

Setup Environment

python3 -m venv venv
source venv/bin/activate
pip install faker pytest

Run the Samples To run any script in the examples/ folder while working on the source code, you must set the PYTHONPATH to the current directory:

# Set PYTHONPATH to the root so Python can find the 'fakesmith' package
export PYTHONPATH=$PYTHONPATH:.

# Run the main demo
python3 examples/demo_all.py

# Or run any individual sample
python3 examples/export_to_sql_csv.py
python3 examples/sanitize_logs_in_place.py

Explore the examples/ directory The examples/ folder contains several targeted scripts illustrating different features (auto-detection, manual schemas, in-place sanitization, etc.).

Override Auto-Detection

smith = FakeSmith.from_sample(
    my_json,
    overrides={
        # Auto-detected "status" as SENTENCE — override to proper STATUS
        "status": SchemaField("status", FieldType.STATUS, choices=["open", "closed", "resolved"]),
        # Keep a realistic amount range
        "balance": SchemaField("balance", FieldType.AMOUNT, min_value=0, max_value=100000),
    }
)

Custom Fields

import random

smith = FakeSmith([
    SchemaField("ref_code", FieldType.CUSTOM,
        generator=lambda: f"REF-{random.randint(10000, 99999)}"
    ),
    SchemaField("tier", FieldType.CUSTOM,
        generator=lambda: random.choice(["bronze", "silver", "gold", "platinum"])
    ),
])

Supported Field Types

Category	Types
Identity	UUID, FULL_NAME, FIRST_NAME, LAST_NAME, USERNAME, EMAIL, PHONE, PASSWORD, PASSWORD_HASH
Location	ADDRESS, CITY, STATE, COUNTRY, ZIP_CODE, LATITUDE, LONGITUDE
Finance	CARD_NUMBER, CARD_EXPIRY, CARD_CVV, BANK_ACCOUNT, IBAN, AMOUNT, CURRENCY
Business	COMPANY, JOB_TITLE, DEPARTMENT, API_KEY, SECRET_TOKEN, JWT_TOKEN, WEBHOOK_URL
Dates	DATETIME, DATE, TIME, DATE_OF_BIRTH, TIMESTAMP
Web & Tech	IP_ADDRESS, IPV6, MAC_ADDRESS, USER_AGENT, URL, DOMAIN, SLUG, JWT_TOKEN
Content	WORD, SENTENCE, PARAGRAPH, TITLE, DESCRIPTION, TAG
Numeric	INTEGER, FLOAT, BOOLEAN, PERCENTAGE
Enums	STATUS, GENDER, CUSTOM

Why FakeSmith?

LLM-safe — no real credentials, PII, or secrets ever leave your machine
Zero config — paste a sample and go
Structurally identical — same field names, same types, realistic values
All formats — JSON, CSV, SQL, .env
Extensible — override any field with a custom generator

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
examples		examples
fakesmith		fakesmith
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FakeSmith

Install

Quick Start

Option 1 — Auto-detect from a sample

Option 3 — Quick dict shorthand

Output Formats

CLI

In-place Sanitization

Run the Samples

Override Auto-Detection

Custom Fields

Supported Field Types

Why FakeSmith?

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FakeSmith

Install

Quick Start

Option 1 — Auto-detect from a sample

Option 3 — Quick dict shorthand

Output Formats

CLI

In-place Sanitization

Run the Samples

Override Auto-Detection

Custom Fields

Supported Field Types

Why FakeSmith?

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages