Skip to content

GENORROW/ORATIS-T

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORATIS-T

Study design and objective

This study introduces ORATIS-T, a multimodal machine learning framework designed to estimate trauma-relevant symptom severity from naturalistic speech. The objective was to evaluate whether linguistic and paralinguistic features extracted from semi-structured clinical interviews contain sufficient signal to support continuous symptom severity prediction, thereby demonstrating the feasibility of speech-based computational markers as adjuncts to traditional psychiatric assessment. The work is positioned as a methodological and architectural proof-of-concept, rather than a diagnostic system, with emphasis on reproducibility, interpretability, and scalability.

1. Data Methods

Incremental dataset construction

To enable processing under limited storage constraints and to reflect real-world system deployment conditions, the dataset was constructed incrementally. Participants were processed in small batches, with raw audio and transcript files deleted immediately following successful feature extraction. Only derived numerical features and embeddings were retained. This approach ensured that the final dataset was compact, non-identifiable, and reproducible without requiring redistribution of raw clinical speech data.

Transcript preprocessing

Interview transcripts were lowercased and minimally normalized to preserve linguistic structure while reducing orthographic variability. No aggressive stop-word removal or lemmatization was applied, as such operations can obscure clinically relevant language patterns, including pronoun usage and disfluencies. Sentence tokenization was performed using rule-based tokenizers to ensure consistent segmentation across transcripts.

Acoustic preprocessing

Audio files were processed in their native sampling rate where possible. Feature extraction operated on full interview recordings rather than segmented utterances, yielding participant-level summary statistics. No speaker diarization or noise suppression was applied, reflecting the ecological conditions under which clinical interviews are typically conducted.

Feature aggregation

Linguistic and acoustic features were aggregated at the participant level using summary statistics (mean, variance, counts). Dense semantic embeddings were pooled using arithmetic mean across sentences, yielding a fixed-dimensional representation per participant. Missing values were imputed with zeros, corresponding to absence of measurable signal rather than inferred estimates.

Model selection rationale

A Random Forest regressor was selected as the primary estimator due to its ability to model nonlinear relationships without strong parametric assumptions, its robustness to small datasets, and its interpretability via feature importance measures. No extensive hyperparameter optimization was performed to reduce the risk of overfitting and to maintain methodological transparency.

2. Training & Evaluation

Training protocol

The dataset was partitioned into training and testing subsets according to predefined split files. All experiments were conducted with fixed random seeds to ensure reproducibility.

Evaluation metrics

Model performance was evaluated using:
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
• Coefficient of determination (R²)
Given the modest dataset size, R² values were interpreted cautiously and primarily used to assess relative performance trends rather than absolute predictive power.

3. Statistical Analysis

  • All statistical analyses were performed using Python-based scientific computing libraries. Model performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R²).
  • MAE was selected as the primary error metric due to its robustness to outliers and its direct interpretability in the units of the target severity score. RMSE was additionally reported to emphasize larger deviations and capture error variance. R² values were reported for completeness but interpreted conservatively given the modest sample size and the known instability of variance-explained metrics in small, heterogeneous clinical datasets.
  • Train–test splits followed the predefined dataset partitions to avoid information leakage. All random processes, including data partitioning and model initialization, were controlled using fixed random seeds to ensure reproducibility.
  • No formal null-hypothesis significance testing was conducted, as the study objective was methodological feasibility rather than hypothesis confirmation. Instead, performance metrics were used descriptively to evaluate whether speech-derived features contained sufficient signal to support continuous symptom severity estimation beyond chance-level baselines.
  • Where appropriate, performance was compared against a naïve baseline predictor (mean severity prediction) to contextualize model error magnitude. Statistical comparisons across feature subsets were conducted through ablation analyses rather than inferential tests.

4. Implementation & Reproducibility

All components of ORATIS-T were implemented in Python. The pipeline was modularized to allow independent modification of feature extraction, fusion, and modeling stages. All hyperparameters and paths were defined via external configuration files to support reproducibility. Derived datasets and trained models were stored in compact, portable formats (parquet, joblib) to facilitate replication without redistribution of raw audio data.

5. Ethical Considerations

This study used a publicly available dataset collected under institutional review. No new data were collected. All processing was conducted on de-identified data, and raw audio files were discarded after feature extraction to minimize privacy risk. ORATIS-T is intended strictly as a research and decision-support prototype and does not constitute a diagnostic system. Predictions generated by the model should not be interpreted as clinical diagnoses or used for individual-level decision-making without appropriate clinical oversight

6. Limitations & Scoping

This study has several important limitations that constrain interpretation and generalization.

  • First, the dataset size is modest and drawn from a single publicly available corpus. While sufficient to demonstrate methodological feasibility, the sample does not support strong claims of clinical validity or population-level generalizability. Larger, more diverse datasets will be required to establish robustness across demographics, dialects, and recording conditions.
  • Second, the symptom severity labels used in this study were not originally curated for PTSD-specific diagnosis. Although the modeling framework is label-agnostic and readily extendable to PTSD-specific instruments, the present results should be interpreted as demonstrating speech-based symptom signal extraction rather than PTSD diagnosis per se.
  • Third, the study relies on participant-level aggregation of features across entire interviews. This design sacrifices temporal resolution and may obscure dynamic symptom expression within sessions. Future work may benefit from utterance-level modeling and temporal architectures.
  • Fourth, no clinician-in-the-loop validation was performed. As such, model outputs should not be interpreted as clinically actionable assessments. ORATIS-T is positioned strictly as a research and decision-support prototype.
  • Finally, speech-based models are vulnerable to biases related to language, culture, accent, and socioeconomic context. These factors were not explicitly controlled for in the present study and must be addressed before any real-world deployment.

Taken together, these limitations emphasize that the present work is architectural and methodological, not diagnostic. The contribution lies in demonstrating that multimodal speech features can be systematically extracted, fused, and modeled in a reproducible manner — providing a foundation for future, clinically grounded investigations.

To summarize, this work emphasizes and fantastically demonstrates methodological feasibility rather than clinical validity. Dataset size, label specificity, and demographic diversity impose limitations on generalizability. This allows the future scoping of this work to focus on crafting foundational PTSD-specific datasets, longitudinal validation, and clinician-in-the-loop evaluation to widen the applicative potency of the architecture.

About

A multimodal ML pipeline for estimating and quantifying PTSD severity using psycholinguistic and acoustic markers, combining transformer-based embeddings with paralinguistic speech analysis in a deployable research framework.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages