Skip to content

Latest commit

 

History

History
33 lines (23 loc) · 1.68 KB

File metadata and controls

33 lines (23 loc) · 1.68 KB

Data Pipeline

Menus and scripts for collecting, processing, uploading, and automating analytics on CDN server logs.

Submodules

  • collection — gathers logs from v4 (Apache), v5 (OC4D), v3 (D-Hub), and v6 (OC4D with module paths)
  • process — parses logs into CSV summaries via processors
  • upload — manual month filtering and S3 upload
  • automation — unattended runs with systemd

End‑to‑end flow

  1. Collect: copies logs into 00_DATA/LOCATION_logs_YYYY_MM_DD and decompresses .gz
  2. Process: writes 00_DATA/00_PROCESSED/RUN/summary.csv using the right processor
  3. Finalize + Upload: either manual (upload) or scheduled (automation/runner.sh)

Data contracts

  • Input logs (v4): text lines in Apache combined format (access.log*)
  • Input logs (v5): JSON per line with a message field that embeds HTTP request data
  • Input logs (v3): JSON per line with a message field; paths include UUID /modules/[uuid]/[module-name]/, /uploads/modules/[uuid]/[module-name]/, or /uploads/other-modules/[module-name]/
  • Input logs (v6): JSON per line with a message field stored in /var/log/oc4d; paths include module paths similar to v3
  • Output CSV (summary.csv) columns (vary by processor) include at least:
    • IP Address, Access Date, Module Viewed, Status Code, Data Saved (GB), Device Used, Browser Used
    • Some processors (e.g., castle.py) also include Access Time and Location Viewed
    • dhub.py and log-v6.py use the same schema as logv2.py, extracting module names from extended module paths

Where to start

  • Use main.sh to drive the whole flow, or jump into each submodule

See also: scripts/data/automation/README.md for unattended scheduling.