Content management, ingestion, and processing implementation for ThemisDB.
Provides multi-format content ingestion and processing for ThemisDB, handling JSON documents, images, geospatial data, and text extraction with MIME detection and zstd compression.
In scope: Multi-format content ingestion (JSON, images, documents), MIME type detection, text extraction and processing, image metadata extraction, geospatial data processing, zstd compression.
Out of scope: Full-text indexing (handled by search module), vector embedding generation (handled by LLM/RAG modules), legacy Office formats (DOC/XLS/PPT via LibreOffice headless — planned).
content_manager.cpp— orchestrates ingestion pipelinecontent_type.cpp— MIME detection and type classificationtext_processor.cpp— text extractionhtml_processor.cpp— HTML text extraction with boilerplate removalimage_processor.cpp— image metadatapipeline/— processing stage pipeline
Maturity: 🟢 Production-Ready — Core content ingestion, PDF (poppler-cpp), Office OOXML/ODF (libzip+pugixml), legacy Office formats (.doc/.xls/.ppt via LibreOffice headless, CON-001), OCR (Tesseract, CON-002/CON-003), streaming ingestion, perceptual deduplication, and embedding pipeline are all operational.
- Content manager
- Content type detection
- Text processors
- Image processors
- Geo processors
- Content ingestion pipeline
- Multi-format content ingestion (JSON, images, documents)
- MIME type detection
- Text extraction and processing
- Image metadata extraction
- Geospatial data processing
- Content compression (zstd)
For content documentation, see:
- Content Manager
- Content Type
- Text Processor
- Content Architecture
- Content Pipeline
- Content Processors
-
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
-
Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill. ISBN: 978-0-070-54484-0
-
Dublin Core Metadata Initiative. (2012). DCMI Metadata Terms. DCMI Recommendation. https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
-
W3C. (2013). PROV-O: The PROV Ontology. W3C Recommendation. https://www.w3.org/TR/prov-o/