This guide will help you install and configure the Agentic Data Engineering Platform for healthcare data processing.
| Component | Minimum | Recommended |
|---|---|---|
| Operating System | Linux, macOS, Windows 10+ | Linux (Ubuntu 20.04+) |
| Python | 3.8+ | 3.9+ |
| Memory | 8 GB RAM | 16+ GB RAM |
| Storage | 10 GB free space | 50+ GB free space |
| Network | Internet connection | High-speed broadband |
- Unity Catalog enabled workspace
- Admin permissions for initial setup
- Compute access for cluster creation
- Workspace files access for deploying code
Choose one of the following:
AWS S3
- S3 bucket for raw data storage
- IAM role with S3 read/write permissions
- VPC endpoints (recommended for security)
Azure Data Lake Storage (ADLS)
- ADLS Gen2 storage account
- Service Principal with storage permissions
- Private endpoints (recommended)
Google Cloud Storage (GCS)
- GCS bucket for data storage
- Service Account with storage admin role
- Private Google Access enabled
- Databricks workspace URL and Personal Access Token
- Cloud storage credentials (access keys, service principals, etc.)
- Network access to Databricks APIs and cloud storage
- Admin access to configure Unity Catalog (initial setup only)
# Clone the repository
git clone https://github.com/yourorg/agenticdataengineering.git
cd agenticdataengineering
# Verify structure
ls -laOption A: Using venv (Recommended)
# Create virtual environment
python -m venv venv
# Activate environment
# Linux/macOS:
source venv/bin/activate
# Windows:
venv\Scripts\activate
# Verify Python version
python --version # Should be 3.8+Option B: Using conda
# Create conda environment
conda create -n healthcare-platform python=3.9 -y
conda activate healthcare-platform# Install core dependencies
pip install -r requirements.txt
# Verify installation
python -c "import databricks.sdk; print('✅ Databricks SDK installed')"
python -c "import streamlit; print('✅ Streamlit installed')"# Copy environment template
cp .env.example .env
# Edit with your credentials
nano .env # or your preferred editor# Test CLI
python -m src.cli --help
# Test imports
python -c "from src.agents.quality import QualityEngine; print('✅ Platform ready')"# Clone with specific branch (if needed)
git clone -b main https://github.com/yourorg/agenticdataengineering.git
cd agenticdataengineering
# Check repository status
git status
git log --oneline -5Virtual Environment Best Practices
# Use specific Python version
python3.9 -m venv venv
# Upgrade pip and setuptools
pip install --upgrade pip setuptools wheel
# Install dependencies with version locking
pip install -r requirements.txt --no-deps
pip check # Verify no dependency conflictsEnvironment Variables
# Add to your shell profile (~/.bashrc, ~/.zshrc)
export HEALTHCARE_PLATFORM_HOME="$(pwd)"
export PATH="$HEALTHCARE_PLATFORM_HOME/venv/bin:$PATH"
# Reload shell or run:
source ~/.bashrc # or ~/.zshrcRequired Packages
# Core Databricks integration
pip install databricks-sdk>=0.20.0
# Data processing
pip install pyspark>=3.4.0 delta-spark>=2.4.0
# Web interface
pip install streamlit>=1.28.0 plotly>=5.17.0
# Data quality
pip install great-expectations>=0.17.0 pandera>=0.17.0
# Configuration and utilities
pip install pyyaml>=6.0 python-dotenv>=1.0.0 click>=8.1.0Optional Dependencies
# For advanced features (install if needed)
pip install mlflow>=2.5.0 # ML model tracking
pip install prometheus-client>=0.17.0 # Advanced monitoring
pip install sentry-sdk>=1.38.0 # Error trackingDevelopment Dependencies
# Only for development environments
pip install -r requirements-dev.txt
# Or individually:
pip install pytest>=7.4.0 pytest-cov>=4.1.0 black>=23.0.0 mypy>=1.6.0Create .env file with your specific configuration:
# Copy template
cp .env.example .envRequired Environment Variables
# Databricks Configuration
DATABRICKS_TOKEN=dapi-your-token-here
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_WORKSPACE_ID=your-workspace-id
# Unity Catalog Configuration
DATABRICKS_CATALOG_NAME=healthcare_data
DATABRICKS_SCHEMA_BRONZE=bronze
DATABRICKS_SCHEMA_SILVER=silver
DATABRICKS_SCHEMA_GOLD=goldCloud Storage Configuration
AWS S3
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-east-1
S3_BUCKET_RAW_DATA=healthcare-raw-dataAzure ADLS
AZURE_TENANT_ID=your-tenant-id
AZURE_CLIENT_ID=your-client-id
AZURE_CLIENT_SECRET=your-client-secret
AZURE_STORAGE_ACCOUNT=your-storage-accountGoogle Cloud Storage
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
GCP_PROJECT_ID=your-project-id
GCS_BUCKET=healthcare-data-bucketPersonal Access Token Creation
- Log in to your Databricks workspace
- Go to User Settings → Developer → Access tokens
- Click Generate new token
- Set expiration (recommend 90 days for production)
- Copy token and add to
.envfile
Unity Catalog Setup
# Verify Unity Catalog access
python -c "
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
catalogs = list(w.catalogs.list())
print(f'✅ Unity Catalog access verified. Found {len(catalogs)} catalogs')
"Initialize Unity Catalog Structure
# Initialize the healthcare data catalog
python -m src.cli catalog init
# Verify catalog creation
python -m src.cli catalog health healthcare_dataTest Platform Components
# Test quality engine
python -c "
from src.agents.quality.quality_engine import QualityEngine
print('✅ Quality Engine imported successfully')
"
# Test dashboard
streamlit run src/ui/dashboard.py &
sleep 5
curl -f http://localhost:8501 && echo '✅ Dashboard accessible'# Healthcare Data Platform Configuration
databricks:
host: "${DATABRICKS_HOST}"
token: "${DATABRICKS_TOKEN}"
workspace_id: "${DATABRICKS_WORKSPACE_ID}"
unity_catalog:
catalog_name: "healthcare_data"
default_storage_location: "s3://healthcare-data-platform/catalog/"
pipelines:
medicaid_claims:
source_path: "s3://healthcare-raw-data/medicaid/claims/"
target_table: "healthcare_data.silver.medicaid_claims"
schedule: "0 */6 * * *"
cluster_config:
node_type: "i3.xlarge"
min_workers: 2
max_workers: 10This file is automatically included and contains healthcare-specific validation rules. You can customize thresholds:
# Customize quality thresholds
global_thresholds:
critical_quality_score: 0.60 # Adjust as needed
warning_quality_score: 0.80
target_quality_score: 0.95
# Add custom healthcare rules
healthcare_quality_rules:
custom_member_validation:
member_id_patterns:
- "^[A-Z]{2}[0-9]{9}$" # Your state format# Run comprehensive health check
python -m src.cli status
# Expected output:
# ✅ System Health: 98.5%
# ✅ Data Quality: 96.2%
# ✅ Pipelines Running: 0/0 (none configured yet)
# ✅ Cost Optimization: Ready
# ✅ Self-Healing Actions: 0 today1. Databricks Connectivity
python -c "
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession
# Test workspace connection
w = WorkspaceClient()
clusters = list(w.clusters.list())
print(f'✅ Databricks connected. Found {len(clusters)} clusters')
# Test Unity Catalog
catalogs = list(w.catalogs.list())
print(f'✅ Unity Catalog accessible. Found {len(catalogs)} catalogs')
"2. Storage Access
# Test S3 access (if using AWS)
python -c "
import boto3
s3 = boto3.client('s3')
buckets = s3.list_buckets()['Buckets']
print(f'✅ S3 accessible. Found {len(buckets)} buckets')
"3. Dashboard Launch
# Start dashboard in background
python -m src.cli dashboard --port 8501 &
# Wait and test
sleep 10
curl -f http://localhost:8501/_stcore/health
echo '✅ Dashboard healthy'
# Stop dashboard
pkill -f streamlit# Create test pipeline (optional - requires data)
python -m src.cli pipeline create test_pipeline \
--source-path s3://your-test-bucket/sample-data/ \
--target-table healthcare_data.bronze.test_table
# Check pipeline status
python -m src.cli pipeline metrics test_pipelineSecurity Configuration
# Set restrictive file permissions
chmod 600 .env
chmod -R 755 src/
chmod +x src/cli.py
# Create dedicated user (Linux)
sudo useradd -r -s /bin/false healthcare-platform
sudo chown -R healthcare-platform:healthcare-platform .Service Configuration
# Create systemd service (Linux)
sudo tee /etc/systemd/system/healthcare-platform.service > /dev/null <<EOF
[Unit]
Description=Healthcare Data Platform Dashboard
After=network.target
[Service]
Type=simple
User=healthcare-platform
WorkingDirectory=/opt/healthcare-platform
ExecStart=/opt/healthcare-platform/venv/bin/python -m src.cli dashboard
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl enable healthcare-platform
sudo systemctl start healthcare-platformDevelopment
# Development-specific settings
export ENVIRONMENT=development
export DEBUG_MODE=true
export LOG_LEVEL=DEBUG
export MOCK_EXTERNAL_SERVICES=trueStaging
# Staging environment
export ENVIRONMENT=staging
export DEBUG_MODE=false
export LOG_LEVEL=INFO
export COST_OPTIMIZATION_ENABLED=false # Disable for testingProduction
# Production environment
export ENVIRONMENT=production
export DEBUG_MODE=false
export LOG_LEVEL=WARNING
export HIPAA_COMPLIANT_MODE=true
export COST_OPTIMIZATION_ENABLED=true# docker-compose.yml
version: '3.8'
services:
healthcare-platform:
build: .
ports:
- "8501:8501"
environment:
- DATABRICKS_TOKEN=${DATABRICKS_TOKEN}
- DATABRICKS_HOST=${DATABRICKS_HOST}
volumes:
- ./config:/app/config
- ./logs:/app/logs
restart: unless-stopped# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ ./src/
COPY config/ ./config/
EXPOSE 8501
CMD ["python", "-m", "src.cli", "dashboard", "--host", "0.0.0.0"]Deploy with Docker
# Build and run
docker-compose up -d
# Check logs
docker-compose logs -f healthcare-platform
# Health check
curl http://localhost:8501/_stcore/health1. Python Version Conflicts
# Issue: "Python version not supported"
# Solution: Use pyenv to manage Python versions
curl https://pyenv.run | bash
pyenv install 3.9.16
pyenv local 3.9.162. Dependency Conflicts
# Issue: Package conflicts during installation
# Solution: Use clean environment
pip cache purge
pip install --no-cache-dir --force-reinstall -r requirements.txt3. Databricks Connection Issues
# Issue: "databricks.sdk authentication failed"
# Solution: Verify token and permissions
export DATABRICKS_TOKEN=your-token
python -c "
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
print(w.current_user.me())
"4. Import Errors
# Issue: "ModuleNotFoundError"
# Solution: Fix Python path
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
# Or add to .bashrc/.zshrc5. Permission Errors
# Issue: Permission denied
# Solution: Fix file permissions
sudo chown -R $USER:$USER .
chmod -R u+rwX .If you encounter issues:
- Check logs: Look in
/tmp/healthcare-pipeline.log - Verify configuration: Run
python -m src.cli status - Test connectivity: Use the verification commands above
- Consult documentation: See Troubleshooting Guide
- Open an issue: GitHub Issues
After successful installation:
- Configuration Guide - Detailed platform configuration
- Quick Start Tutorial - Create your first pipeline
- CLI Reference - Learn all available commands
- Dashboard Guide - Explore monitoring capabilities
Installation Complete! 🎉
Your Agentic Data Engineering Platform is now ready for healthcare data processing.