This study introduces ORATIS-T, a multimodal machine learning framework designed to estimate trauma-relevant symptom severity from naturalistic speech. The objective was to evaluate whether linguistic and paralinguistic features extracted from semi-structured clinical interviews contain sufficient signal to support continuous symptom severity prediction, thereby demonstrating the feasibility of speech-based computational markers as adjuncts to traditional psychiatric assessment. The work is positioned as a methodological and architectural proof-of-concept, rather than a diagnostic system, with emphasis on reproducibility, interpretability, and scalability.
To enable processing under limited storage constraints and to reflect real-world system deployment conditions, the dataset was constructed incrementally. Participants were processed in small batches, with raw audio and transcript files deleted immediately following successful feature extraction. Only derived numerical features and embeddings were retained. This approach ensured that the final dataset was compact, non-identifiable, and reproducible without requiring redistribution of raw clinical speech data.
Interview transcripts were lowercased and minimally normalized to preserve linguistic structure while reducing orthographic variability. No aggressive stop-word removal or lemmatization was applied, as such operations can obscure clinically relevant language patterns, including pronoun usage and disfluencies. Sentence tokenization was performed using rule-based tokenizers to ensure consistent segmentation across transcripts.
Audio files were processed in their native sampling rate where possible. Feature extraction operated on full interview recordings rather than segmented utterances, yielding participant-level summary statistics. No speaker diarization or noise suppression was applied, reflecting the ecological conditions under which clinical interviews are typically conducted.
Linguistic and acoustic features were aggregated at the participant level using summary statistics (mean, variance, counts). Dense semantic embeddings were pooled using arithmetic mean across sentences, yielding a fixed-dimensional representation per participant. Missing values were imputed with zeros, corresponding to absence of measurable signal rather than inferred estimates.
A Random Forest regressor was selected as the primary estimator due to its ability to model nonlinear relationships without strong parametric assumptions, its robustness to small datasets, and its interpretability via feature importance measures. No extensive hyperparameter optimization was performed to reduce the risk of overfitting and to maintain methodological transparency.
The dataset was partitioned into training and testing subsets according to predefined split files. All experiments were conducted with fixed random seeds to ensure reproducibility.
Model performance was evaluated using:
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
• Coefficient of determination (R²)
Given the modest dataset size, R² values were interpreted cautiously and primarily used to assess relative performance trends rather than absolute predictive power.
- All statistical analyses were performed using Python-based scientific computing libraries. Model performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R²).
- MAE was selected as the primary error metric due to its robustness to outliers and its direct interpretability in the units of the target severity score. RMSE was additionally reported to emphasize larger deviations and capture error variance. R² values were reported for completeness but interpreted conservatively given the modest sample size and the known instability of variance-explained metrics in small, heterogeneous clinical datasets.
- Train–test splits followed the predefined dataset partitions to avoid information leakage. All random processes, including data partitioning and model initialization, were controlled using fixed random seeds to ensure reproducibility.
- No formal null-hypothesis significance testing was conducted, as the study objective was methodological feasibility rather than hypothesis confirmation. Instead, performance metrics were used descriptively to evaluate whether speech-derived features contained sufficient signal to support continuous symptom severity estimation beyond chance-level baselines.
- Where appropriate, performance was compared against a naïve baseline predictor (mean severity prediction) to contextualize model error magnitude. Statistical comparisons across feature subsets were conducted through ablation analyses rather than inferential tests.
All components of ORATIS-T were implemented in Python. The pipeline was modularized to allow independent modification of feature extraction, fusion, and modeling stages. All hyperparameters and paths were defined via external configuration files to support reproducibility. Derived datasets and trained models were stored in compact, portable formats (parquet, joblib) to facilitate replication without redistribution of raw audio data.
This study used a publicly available dataset collected under institutional review. No new data were collected. All processing was conducted on de-identified data, and raw audio files were discarded after feature extraction to minimize privacy risk. ORATIS-T is intended strictly as a research and decision-support prototype and does not constitute a diagnostic system. Predictions generated by the model should not be interpreted as clinical diagnoses or used for individual-level decision-making without appropriate clinical oversight
This study has several important limitations that constrain interpretation and generalization.
- First, the dataset size is modest and drawn from a single publicly available corpus. While sufficient to demonstrate methodological feasibility, the sample does not support strong claims of clinical validity or population-level generalizability. Larger, more diverse datasets will be required to establish robustness across demographics, dialects, and recording conditions.
- Second, the symptom severity labels used in this study were not originally curated for PTSD-specific diagnosis. Although the modeling framework is label-agnostic and readily extendable to PTSD-specific instruments, the present results should be interpreted as demonstrating speech-based symptom signal extraction rather than PTSD diagnosis per se.
- Third, the study relies on participant-level aggregation of features across entire interviews. This design sacrifices temporal resolution and may obscure dynamic symptom expression within sessions. Future work may benefit from utterance-level modeling and temporal architectures.
- Fourth, no clinician-in-the-loop validation was performed. As such, model outputs should not be interpreted as clinically actionable assessments. ORATIS-T is positioned strictly as a research and decision-support prototype.
- Finally, speech-based models are vulnerable to biases related to language, culture, accent, and socioeconomic context. These factors were not explicitly controlled for in the present study and must be addressed before any real-world deployment.
Taken together, these limitations emphasize that the present work is architectural and methodological, not diagnostic. The contribution lies in demonstrating that multimodal speech features can be systematically extracted, fused, and modeled in a reproducible manner — providing a foundation for future, clinically grounded investigations.
To summarize, this work emphasizes and fantastically demonstrates methodological feasibility rather than clinical validity. Dataset size, label specificity, and demographic diversity impose limitations on generalizability. This allows the future scoping of this work to focus on crafting foundational PTSD-specific datasets, longitudinal validation, and clinician-in-the-loop evaluation to widen the applicative potency of the architecture.