Skip to content

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243

Open
ujjwalr27 wants to merge 6 commits intoCCExtractor:masterfrom
ujjwalr27:feature/multi-dvb-subtitle-extraction
Open

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243
ujjwalr27 wants to merge 6 commits intoCCExtractor:masterfrom
ujjwalr27:feature/multi-dvb-subtitle-extraction

Conversation

@ujjwalr27
Copy link
Copy Markdown
Contributor

[FEATURE] Auto-extract multi-language DVB subtitles into per-language files

Closes #447

In raising this pull request, I confirm the following (please check boxes):

Reason for this PR:

  • This PR adds new functionality.
  • This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
  • This PR is porting code from C to Rust.

Sanity check:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the changelog.
  • I am NOT adding new C code unless it's to fix an existing, reproducible bug.

⚠️ This PR adds new C code for a feature requested in #447 by a real user, with a provided sample file.


Description

Implements #447 — when a DVB/TS recording contains multiple DVB subtitle streams, CCExtractor now automatically detects each stream and writes subtitles to separate files named by ISO-639 language code. No manual configuration or pre-inspection of the file is required.

Before:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt   (only first/primary stream extracted)
# French DVB subtitle stream silently ignored

After:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt        (teletext / primary stream)
# → arte_multiaudio_fra.srt    (French DVB subtitles, auto-detected)

No new CLI flags. Fully automatic. Single-stream recordings are unaffected.


Repro Instructions

Test 1 — arte_multiaudio.ts (from issue #447)

Download: https://www.dropbox.com/s/5oaqnjgqq1cqzky/arte_multiaudio.ts?dl=0

The file contains:

PID Type Language
0x103 (259) DVB Teletext deu
0x104 (260) DVB Subtitle deu (no bitmap packets in this recording)
0x106 (262) DVB Subtitle fra

Before this PR (on master):

./ccextractor arte_multiaudio.ts
# Only produces arte_multiaudio.srt (teletext)
# French DVB subtitle stream is silently ignored

After this PR:

./ccextractor arte_multiaudio.ts
DVB subtitle PID 260 language: deu
DVB subtitle PID 262 language: fra
...
-rw-r--r-- 4106 arte_multiaudio.srt       <- Teletext subtitles
-rw-r--r-- 3924 arte_multiaudio_fra.srt   <- DVB bitmap subtitles (fra, newly extracted)
Exit code: 0

Also verified with --codec dvbsub:

./ccextractor arte_multiaudio.ts --codec dvbsub
# → arte_multiaudio_fra.srt  (3924 bytes)
# Exit code: 0

Test 2 — DVB-only file with two subtitle streams (deu + fra)

A recording with no teletext, only two DVB subtitle PIDs:

PID Type Language
index 2 DVB Subtitle deu
index 3 DVB Subtitle fra
./ccextractor test_two_dvb.ts
DVB subtitle PID ... language: deu
DVB subtitle PID ... language: fra
...
-rw-r--r-- 3924 test_two_dvb_deu.srt   <- German-tagged DVB subtitles
-rw-r--r-- 3924 test_two_dvb_fra.srt   <- French-tagged DVB subtitles
Exit code: 0

Both files are produced automatically in a single pass, with no flags or prior knowledge of how many subtitle streams exist.


Implementation

Files changed

File Change
src/lib_ccx/ccx_demuxer.h Add char lang[4] to cap_info struct
src/lib_ccx/ts_tables.c Parse ISO-639 code from DVB subtitle descriptor in PMT
src/lib_ccx/ts_info.c Propagate lang in update_capinfo(); protect DVB streams from ignore_other_stream()
src/lib_ccx/lib_ccx.c Per-PID encoder/decoder routing; fix two segfaults in cleanup
src/lib_ccx/general_loop.c Secondary loop to process all non-primary DVB subtitle PIDs
src/rust/src/demuxer/common_types.rs Add lang: [i8; 4] to CapInfo
src/rust/src/ctorust.rs Propagate lang in FromCType<cap_info>
src/rust/src/common.rs Propagate lang in CType<cap_info>

Key design decisions

Per-PID decoders in single-program mode
Each DVB subtitle PID has its own DVBSubContext with different composition_id/ancillary_id from the PMT. The existing single-decoder model was extended to always create a fresh decoder per DVB PID.

Language-tagged output filenames
update_encoder_list_cinfo() uses cinfo->lang to suffix the output filename, matching existing behaviour for multi-program mode.

Separate encoder/decoder cleanup
dinit_libraries() previously matched encoders by program number inside the decoder loop — with multiple DVB encoders sharing the same program number this caused double-free on exit. Fixed by splitting into two independent passes.

dec_ctx->prev zero-initialization
dec_ctx->prev was malloc'd but not memset; free_decoder_context() during cleanup freed garbage pointers. Fixed with memset(prev, 0, sizeof(...)).

Copy link
Copy Markdown
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review Results

First off — this is a really well-done PR. The description is excellent, the repro instructions are clear, and the code is clean. This is the quality level we want from all contributors.

What works well

  • Feature works correctly: The arte_multiaudio.ts sample now produces both arte.srt (teletext) and arte_fra.srt (French DVB) in a single pass with no flags needed.
  • No repeating subtitles: The old attempt at this feature (PRs #1912/#2048/#2051/#2058) had bugs where subtitles repeated or timestamps started at zero. None of those bugs are present here.
  • Content is byte-identical to master on all existing single-stream samples — the decoding logic is correct.
  • Cleanup fixes are good: The split encoder/decoder cleanup, the memset for dec_ctx->prev, and the transcript_settings deep-copy all fix real issues.
  • Output with -o flag works correctly.
  • Tested across 12+ samples (CEA-608, DVB, DVR-MS, ASF, MP4, TS, MPG) — zero content regressions.

Issue found: filename regression on single-DVB-stream files

We ran all 25 CI test cases locally on both master and this PR. On 3 tests, the PR changes the output filename by adding a language suffix (_eng) even when there's only a single DVB stream:

Test Master filename PR filename Content
1020459a86 --autoprogram --out=ttxt output.out output_eng.txt Byte-identical
85271be4d2 --autoprogram --out=srt --quant 0 output.out output_eng.srt Byte-identical
85271be4d2 --codec dvbsub --out=spupng output.out + output.d/ output_eng.xml + output_eng.d/ Byte-identical (all 28 PNGs)

The content is correct — only the filename changes. But this breaks backward compatibility for existing users/scripts that expect the original filename.

Fix: Only add the language suffix when the program has 2 or more DVB subtitle PIDs. Single-DVB-stream recordings should keep the original filename.

Also needed

  • Add a CHANGES.TXT entry — this is a user-facing feature.

Everything else looks good. Once the filename issue is fixed, this is ready to merge.

@ujjwalr27 ujjwalr27 force-pushed the feature/multi-dvb-subtitle-extraction branch from 63638f0 to a4d10b9 Compare April 7, 2026 05:42
@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 5fdd9b8...:
Report Name Tests Passed
Broken 10/13
CEA-708 1/14
DVB 3/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 79/86
Teletext 20/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 f1422b8bfe...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 56c9f34548..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 5d3a29f9f8..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 4e56e88ba4..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 c0d2fba8c0..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 27d7a43dd6..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 e2e2b501e0..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 5fdd9b8...:
Report Name Tests Passed
Broken 10/13
CEA-708 1/14
DVB 2/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 75/86
Teletext 20/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 f1422b8bfe...
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 b22260d065..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e..., Last passed: Never
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@cfsmp3
Copy link
Copy Markdown
Contributor

cfsmp3 commented Apr 8, 2026

Re-test Results (Apr 7)

The filename regression is fixed — all 25 CI tests produce byte-identical output to master. Nice work on the dvb_pid_count >= 2 guard.

We also tested on all multi-DVB samples we have locally. Results are mixed:

Multi-DVB feature testing

Sample DVB streams Master PR Result
arte (26MB) deu + fra 1 file (teletext only) 2 files (teletext + _fra.srt) WORKS
36d5eca53c56 (14MB) dan + dan (hearing impaired) No output _dan.srt (802 bytes) WORKS — extracts DVB that master couldn't
04e47919de59 (11GB) CHI + ENG + CHS (PIDs 0x50, 0x51, 0x52) 1 file (62KB) 1 file (62KB) FAILS — no separate language files

The feature works on arte and 36d5 but fails on the 3-language Chinese/English sample. This sample has 3 DVB subtitle streams clearly visible to ffprobe:

Stream #0:3[0x50](CHI): Subtitle: dvb_subtitle
Stream #0:4[0x51](ENG): Subtitle: dvb_subtitle
Stream #0:5[0x52](CHS): Subtitle: dvb_subtitle

But ccextractor doesn't produce _chi.srt, _eng.srt, _chs.srt — it only produces a single file with the OCR output from one stream (same as master).

The most likely cause is that ccextractor's PMT parser isn't detecting all 3 DVB subtitle PIDs in this sample. The mprint("DVB subtitle PID %u language: %s\n") message doesn't appear in the output for this file (we checked), which means the CCX_MPEG_DSC_DVB_SUBTITLE (0x59) descriptor parsing path in ts_tables.c isn't being reached for all 3 streams.

What we need

Please investigate why 04e47919de59 doesn't work. You can get it from our failed samples directory, or I can provide the hash:

04e47919de5908edfa1fddc522a811d56bc67a1d4020f8b3972709e25b15966c.ts

If the issue is fundamental (e.g., the sample uses a non-standard PMT structure), document it as a known limitation. If it's fixable, please fix it.

Everything else looks good

  • 25 CI tests: all identical to master
  • 13 single-DVB samples: all identical to master
  • CHANGES.TXT: present
  • Filename regression: fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi lang, each cc into a new file

3 participants