Warehouses in Pangolin define where your actual data files are stored (Parquet, Avro, ORC files), separate from the catalog metadata.
It's important to understand the distinction:
| Component | Purpose | Storage | Examples |
|---|---|---|---|
| Backend Storage | Catalog metadata | PostgreSQL, MongoDB, SQLite | Table schemas, partitions, snapshots |
| Warehouse Storage | Actual data files | S3, Azure Blob, GCS | Parquet files, metadata.json |
┌─────────────────────────────────────┐
│ Pangolin Catalog (Backend) │
│ - Table schemas │
│ - Partition info │
│ - Snapshot metadata │
│ Stored in: PostgreSQL/Mongo/SQLite│
└──────────────┬──────────────────────┘
│ Points to
▼
┌─────────────────────────────────────┐
│ Warehouse (Object Storage) │
│ - Parquet data files │
│ - Iceberg metadata files │
│ - Manifest files │
│ Stored in: S3/Azure/GCS │
└─────────────────────────────────────┘
A warehouse in Pangolin is a named configuration that specifies:
- Storage type: S3, Azure Blob Storage, or Google Cloud Storage
- Location: Bucket/container and path prefix
- Credentials: How to authenticate (static credentials or STS/IAM roles)
- Region: Geographic location of storage
Tip
Flat Key Support: As of v0.1.0, storage_config supports flat keys (e.g., "s3.bucket": "mybucket") as an alternative to nested objects. This is often easier to pass via CLI or environment-driven scripts.
{
"name": "production-s3",
"storage_config": {
"s3.bucket": "my-company-datalake",
"s3.region": "us-east-1"
},
"vending_strategy": {
"type": "AwsSts",
"role_arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
}
}The catalog configuration includes a warehouse reference:
{
"name": "analytics",
"type": "local",
"warehouse": "production-s3",
"properties": {}
}Benefits:
- Centralized credential management
- Consistent storage configuration
- Automatic credential vending to clients
- Easier to manage and audit
Client Configuration: Minimal - Pangolin vends credentials automatically
The catalog has no warehouse attached:
{
"name": "analytics",
"type": "local",
"warehouse": null,
"properties": {}
}Benefits:
- Clients control their own storage access
- Flexible for multi-cloud scenarios
- Useful when clients have their own credentials
Client Configuration: Clients must configure storage themselves
Pangolin uses the vending_strategy field to configure credential vending. The use_sts field is deprecated but kept for backward compatibility.
Available Strategies:
{
"name": "prod-s3",
"storage_config": {
"bucket": "my-datalake",
"region": "us-east-1",
"s3.role-arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
},
"vending_strategy": {
"AwsSts": {
"role_arn": "arn:aws:iam::123456789:role/PangolinDataAccess"
}
}
}Note
STS Vending requires PANGOLIN_STS_ROLE_ARN server configuration.
{
"name": "dev-s3",
"storage_config": {
"bucket": "my-dev-datalake",
"region": "us-east-1",
"s3.access-key-id": "AKIAIOSFODNN7EXAMPLE",
"s3.secret-access-key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
},
"vending_strategy": "AwsStatic"
}Pros: Simple, works everywhere (Development) Cons: Less secure, credentials don't expire
{
"name": "client-provided",
"storage_config": {
"bucket": "my-datalake"
},
"vending_strategy": "None"
}Clients must provide their own credentials via environment variables or Spark/Iceberg config.
The use_sts boolean field is deprecated. Use vending_strategy instead.
Old Format (Deprecated):
{
"use_sts": true,
"role_arn": "arn:aws:iam::123:role/Access"
}New Format (Current):
{
"vending_strategy": {
"type": "AwsSts",
"role_arn": "arn:aws:iam::123:role/Access",
"external_id": null
}
}| Storage | Status | Best For |
|---|---|---|
| AWS S3 | ✅ Production | Most common, excellent performance |
| Azure Blob | ✅ Production | Azure-native deployments |
| Google Cloud Storage | ✅ Production | GCP-native deployments |
| Local Filesystem | Local development & testing |
curl -X POST http://localhost:8080/api/v1/warehouses \
-H "X-Pangolin-Tenant: my-tenant" \
-H "Content-Type: application/json" \
-d '{
"name": "production-s3",
"storage_config": {
"bucket": "my-datalake",
"region": "us-east-1"
},
"vending_strategy": {
"type": "AwsSts",
"role_arn": "arn:aws:iam::123456789:role/DataAccess",
"external_id": null
}
}'curl -X POST http://localhost:8080/api/v1/catalogs \
-H "X-Pangolin-Tenant: my-tenant" \
-H "Content-Type: application/json" \
-d '{
"name": "analytics",
"type": "local",
"warehouse": "production-s3"
}'from pyiceberg.catalog import load_catalog
# Pangolin vends credentials automatically
catalog = load_catalog(
"pangolin",
**{
"uri": "http://localhost:8080/api/v1/catalogs/analytics",
"warehouse": "s3://my-datalake/analytics/"
}
)
# Create table - Pangolin handles storage access
catalog.create_table(
"db.table",
schema=schema
)When a catalog has a warehouse attached, Pangolin automatically vends credentials to clients via the X-Iceberg-Access-Delegation header.
PyIceberg: No storage configuration needed PySpark: No storage configuration needed
When a catalog has no warehouse, clients must configure storage themselves.
PyIceberg: Configure S3/Azure/GCS credentials PySpark: Configure Hadoop filesystem properties
See individual storage guides for details:
- Use STS/IAM Roles: Prefer temporary credentials over static keys
- Least Privilege: Grant minimum required permissions
- Separate Warehouses: Use different warehouses for dev/staging/prod
- Audit Access: Enable CloudTrail/Azure Monitor/GCS audit logs
- Regional Colocation: Place warehouse in same region as compute
- Bucket Naming: Use descriptive, hierarchical names
- Lifecycle Policies: Archive old data to cheaper storage tiers
- Compression: Use Snappy or Zstd for Parquet files
- Naming Convention:
{environment}-{region}-{purpose}- Examples:
prod-us-east-1-analytics,dev-eu-west-1-ml
- Examples:
- Path Structure:
s3://bucket/{catalog}/{namespace}/{table}/ - Multi-Tenant: Use separate buckets or prefixes per tenant
Error: Access Denied to s3://my-bucket/path/
Solutions:
- Check IAM role permissions
- Verify bucket policy
- Check STS assume role permissions
- Verify warehouse configuration
Error: No credentials provided
Solutions:
- Ensure catalog has warehouse attached
- Check warehouse
use_stssetting - Verify IAM role ARN
- Check Pangolin server has permission to assume role
Solutions:
- Check region - ensure compute and storage are colocated
- Enable S3 Transfer Acceleration
- Use larger instance types for compute
- Check network bandwidth