feat: auto-extract multi-language DVB subtitles into per-language files (#447) by ujjwalr27 · Pull Request #2243 · CCExtractor/ccextractor

ujjwalr27 · 2026-03-30T09:37:45Z

[FEATURE] Auto-extract multi-language DVB subtitles into per-language files

Closes #447

In raising this pull request, I confirm the following (please check boxes):

Reason for this PR:

This PR adds new functionality.
This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
This PR is porting code from C to Rust.

Sanity check:

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the changelog.
I am NOT adding new C code unless it's to fix an existing, reproducible bug.

⚠️ This PR adds new C code for a feature requested in #447 by a real user, with a provided sample file.

Description

Implements #447 — when a DVB/TS recording contains multiple DVB subtitle streams, CCExtractor now automatically detects each stream and writes subtitles to separate files named by ISO-639 language code. No manual configuration or pre-inspection of the file is required.

Before:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt   (only first/primary stream extracted)
# French DVB subtitle stream silently ignored

After:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt        (teletext / primary stream)
# → arte_multiaudio_fra.srt    (French DVB subtitles, auto-detected)

No new CLI flags. Fully automatic. Single-stream recordings are unaffected.

Repro Instructions

Test 1 — `arte_multiaudio.ts` (from issue #447)

Download: https://www.dropbox.com/s/5oaqnjgqq1cqzky/arte_multiaudio.ts?dl=0

The file contains:

PID	Type	Language
0x103 (259)	DVB Teletext	`deu`
0x104 (260)	DVB Subtitle	`deu` (no bitmap packets in this recording)
0x106 (262)	DVB Subtitle	`fra`

Before this PR (on master):

./ccextractor arte_multiaudio.ts
# Only produces arte_multiaudio.srt (teletext)
# French DVB subtitle stream is silently ignored

After this PR:

./ccextractor arte_multiaudio.ts

DVB subtitle PID 260 language: deu
DVB subtitle PID 262 language: fra
...
-rw-r--r-- 4106 arte_multiaudio.srt       <- Teletext subtitles
-rw-r--r-- 3924 arte_multiaudio_fra.srt   <- DVB bitmap subtitles (fra, newly extracted)
Exit code: 0

Also verified with --codec dvbsub:

./ccextractor arte_multiaudio.ts --codec dvbsub
# → arte_multiaudio_fra.srt  (3924 bytes)
# Exit code: 0

Test 2 — DVB-only file with two subtitle streams (`deu` + `fra`)

A recording with no teletext, only two DVB subtitle PIDs:

PID	Type	Language
index 2	DVB Subtitle	`deu`
index 3	DVB Subtitle	`fra`

./ccextractor test_two_dvb.ts

DVB subtitle PID ... language: deu
DVB subtitle PID ... language: fra
...
-rw-r--r-- 3924 test_two_dvb_deu.srt   <- German-tagged DVB subtitles
-rw-r--r-- 3924 test_two_dvb_fra.srt   <- French-tagged DVB subtitles
Exit code: 0

Both files are produced automatically in a single pass, with no flags or prior knowledge of how many subtitle streams exist.

Implementation

Files changed

File	Change
`src/lib_ccx/ccx_demuxer.h`	Add `char lang[4]` to `cap_info` struct
`src/lib_ccx/ts_tables.c`	Parse ISO-639 code from DVB subtitle descriptor in PMT
`src/lib_ccx/ts_info.c`	Propagate `lang` in `update_capinfo()`; protect DVB streams from `ignore_other_stream()`
`src/lib_ccx/lib_ccx.c`	Per-PID encoder/decoder routing; fix two segfaults in cleanup
`src/lib_ccx/general_loop.c`	Secondary loop to process all non-primary DVB subtitle PIDs
`src/rust/src/demuxer/common_types.rs`	Add `lang: [i8; 4]` to `CapInfo`
`src/rust/src/ctorust.rs`	Propagate `lang` in `FromCType<cap_info>`
`src/rust/src/common.rs`	Propagate `lang` in `CType<cap_info>`

Key design decisions

Per-PID decoders in single-program mode
Each DVB subtitle PID has its own DVBSubContext with different composition_id/ancillary_id from the PMT. The existing single-decoder model was extended to always create a fresh decoder per DVB PID.

Language-tagged output filenames
update_encoder_list_cinfo() uses cinfo->lang to suffix the output filename, matching existing behaviour for multi-program mode.

Separate encoder/decoder cleanup
dinit_libraries() previously matched encoders by program number inside the decoder loop — with multiple DVB encoders sharing the same program number this caused double-free on exit. Fixed by splitting into two independent passes.

dec_ctx->prev zero-initialization
dec_ctx->prev was malloc'd but not memset; free_decoder_context() during cleanup freed garbage pointers. Fixed with memset(prev, 0, sizeof(...)).

cfsmp3

Deep Review Results

First off — this is a really well-done PR. The description is excellent, the repro instructions are clear, and the code is clean. This is the quality level we want from all contributors.

What works well

Feature works correctly: The arte_multiaudio.ts sample now produces both arte.srt (teletext) and arte_fra.srt (French DVB) in a single pass with no flags needed.
No repeating subtitles: The old attempt at this feature (PRs #1912/#2048/#2051/#2058) had bugs where subtitles repeated or timestamps started at zero. None of those bugs are present here.
Content is byte-identical to master on all existing single-stream samples — the decoding logic is correct.
Cleanup fixes are good: The split encoder/decoder cleanup, the memset for dec_ctx->prev, and the transcript_settings deep-copy all fix real issues.
Output with -o flag works correctly.
Tested across 12+ samples (CEA-608, DVB, DVR-MS, ASF, MP4, TS, MPG) — zero content regressions.

Issue found: filename regression on single-DVB-stream files

We ran all 25 CI test cases locally on both master and this PR. On 3 tests, the PR changes the output filename by adding a language suffix (_eng) even when there's only a single DVB stream:

Test	Master filename	PR filename	Content
1020459a86 `--autoprogram --out=ttxt`	`output.out`	`output_eng.txt`	Byte-identical
85271be4d2 `--autoprogram --out=srt --quant 0`	`output.out`	`output_eng.srt`	Byte-identical
85271be4d2 `--codec dvbsub --out=spupng`	`output.out` + `output.d/`	`output_eng.xml` + `output_eng.d/`	Byte-identical (all 28 PNGs)

The content is correct — only the filename changes. But this breaks backward compatibility for existing users/scripts that expect the original filename.

Fix: Only add the language suffix when the program has 2 or more DVB subtitle PIDs. Single-DVB-stream recordings should keep the original filename.

Also needed

Add a CHANGES.TXT entry — this is a user-facing feature.

Everything else looks good. Once the filename issue is fixed, this is ready to merge.

ccextractor-bot · 2026-04-07T08:27:25Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 5fdd9b8...:

Report Name	Tests Passed
Broken	10/13
CEA-708	1/14
DVB	3/7
DVD	3/3
DVR-MS	2/2
General	25/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	79/86
Teletext	20/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --autoprogram --out=srt --latin1 f1422b8bfe...
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

NOTE: The following tests have been failing on the master branch as well as the PR:

ccextractor --out=srt --latin1 --autoprogram 73d9313d64..., Last passed:
Test 8738
ccextractor --out=ttxt --latin1 001dd8cdf7..., Last passed:
Test 8738
ccextractor --out=srt --latin1 4d4e938ef6..., Last passed:
Test 8738
ccextractor --service 1 --out=txt --no-bom --no-rollup ea83ff7bcb..., Last passed:
Test 8738
ccextractor --service 1 --out=txt f17524b53f..., Last passed:
Test 8738
ccextractor --service 1 --out=txt 80848c45f8..., Last passed:
Test 8738
ccextractor --service 1 --out=txt --no-bom --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1[EUC-KR] --out=txt --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1 --out=srt da904de35d..., Last passed:
Test 8738
ccextractor --service 1 --out=sami da904de35d..., Last passed:
Test 8738
ccextractor --service 1 --out=ttxt da904de35d..., Last passed:
Test 8926
ccextractor --service 1[EUC-KR] b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1[EUC-KR] --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service all da904de35d..., Last passed:
Test 8738
ccextractor --service all[EUC-KR] b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1,2[UTF-8],3[EUC-KR],54 --out=txt da904de35d..., Last passed:
Test 8738
ccextractor --autoprogram --out=srt --latin1 d41b53b504..., Last passed:
Test 8738
ccextractor --stdout --quiet --no-fontcolor 79a51f3500..., Last passed:
Test 8738
ccextractor --stdout --quiet --no-fontcolor 767b546f96..., Last passed:
Test 8738
ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9..., Last passed:
Test 9176
ccextractor --autoprogram --out=srt --latin1 b22260d065..., Last passed:
Test 9176
ccextractor --service 1 c83f765c66..., Last passed:
Test 8738
ccextractor --myth c83f765c66..., Last passed:
Test 8738
ccextractor --in=raw fb79021542..., Last passed:
Test 8738
ccextractor --mp4vidtrack 5df914ce77..., Last passed:
Test 8738
ccextractor --xmltv=3 --out=null 96efd279cf..., Last passed:
Test 8738
ccextractor --datapid 2310 --autoprogram --out=srt --latin1 e639e54550..., Last passed:
Test 8738

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 56c9f34548..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla 5d3a29f9f8..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 01509e4d27..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
ccextractor --out=spupng c83f765c66..., Last passed: Never
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 4e56e88ba4..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 c0d2fba8c0..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 27d7a43dd6..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 e2e2b501e0..., Last passed: Never
ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

ccextractor-bot · 2026-04-07T08:52:31Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 5fdd9b8...:

Report Name	Tests Passed
Broken	10/13
CEA-708	1/14
DVB	2/7
DVD	3/3
DVR-MS	2/2
General	25/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	75/86
Teletext	20/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --autoprogram --out=srt --latin1 f1422b8bfe...
ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
ccextractor --out=spupng c83f765c66...
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

NOTE: The following tests have been failing on the master branch as well as the PR:

ccextractor --out=srt --latin1 --autoprogram 73d9313d64..., Last passed:
Test 8611
ccextractor --out=ttxt --latin1 001dd8cdf7..., Last passed:
Test 8611
ccextractor --out=srt --latin1 4d4e938ef6..., Last passed:
Test 8611
ccextractor --service 1 --out=txt --no-bom --no-rollup ea83ff7bcb..., Last passed:
Test 8611
ccextractor --service 1 --out=txt f17524b53f..., Last passed:
Test 8611
ccextractor --service 1 --out=txt 80848c45f8..., Last passed:
Test 8611
ccextractor --service 1 --out=txt --no-bom --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1[EUC-KR] --out=txt --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1 --out=srt da904de35d..., Last passed:
Test 8611
ccextractor --service 1 --out=sami da904de35d..., Last passed:
Test 8611
ccextractor --service 1 --out=ttxt da904de35d..., Last passed:
Test 8943
ccextractor --service 1[EUC-KR] b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1[EUC-KR] --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service all da904de35d..., Last passed:
Test 8611
ccextractor --service all[EUC-KR] b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1,2[UTF-8],3[EUC-KR],54 --out=txt da904de35d..., Last passed:
Test 8611
ccextractor --autoprogram --out=srt --latin1 d41b53b504..., Last passed:
Test 8611
ccextractor --stdout --quiet --no-fontcolor 79a51f3500..., Last passed:
Test 8611
ccextractor --stdout --quiet --no-fontcolor 767b546f96..., Last passed:
Test 8611
ccextractor --service 1 c83f765c66..., Last passed:
Test 8611
ccextractor --myth c83f765c66..., Last passed:
Test 8611
ccextractor --in=raw fb79021542..., Last passed:
Test 8611
ccextractor --mp4vidtrack 5df914ce77..., Last passed:
Test 8611
ccextractor --xmltv=3 --out=null 96efd279cf..., Last passed:
Test 8611
ccextractor --datapid 2310 --autoprogram --out=srt --latin1 e639e54550..., Last passed:
Test 8611

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 b22260d065..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 01509e4d27..., Last passed: Never
ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

cfsmp3 · 2026-04-08T01:19:42Z

Re-test Results (Apr 7)

The filename regression is fixed — all 25 CI tests produce byte-identical output to master. Nice work on the dvb_pid_count >= 2 guard.

We also tested on all multi-DVB samples we have locally. Results are mixed:

Multi-DVB feature testing

Sample	DVB streams	Master	PR	Result
arte (26MB)	deu + fra	1 file (teletext only)	2 files (teletext + `_fra.srt`)	WORKS
36d5eca53c56 (14MB)	dan + dan (hearing impaired)	No output	`_dan.srt` (802 bytes)	WORKS — extracts DVB that master couldn't
04e47919de59 (11GB)	CHI + ENG + CHS (PIDs 0x50, 0x51, 0x52)	1 file (62KB)	1 file (62KB)	FAILS — no separate language files

The feature works on arte and 36d5 but fails on the 3-language Chinese/English sample. This sample has 3 DVB subtitle streams clearly visible to ffprobe:

Stream #0:3[0x50](CHI): Subtitle: dvb_subtitle
Stream #0:4[0x51](ENG): Subtitle: dvb_subtitle
Stream #0:5[0x52](CHS): Subtitle: dvb_subtitle

But ccextractor doesn't produce _chi.srt, _eng.srt, _chs.srt — it only produces a single file with the OCR output from one stream (same as master).

The most likely cause is that ccextractor's PMT parser isn't detecting all 3 DVB subtitle PIDs in this sample. The mprint("DVB subtitle PID %u language: %s\n") message doesn't appear in the output for this file (we checked), which means the CCX_MPEG_DSC_DVB_SUBTITLE (0x59) descriptor parsing path in ts_tables.c isn't being reached for all 3 streams.

What we need

Please investigate why 04e47919de59 doesn't work. You can get it from our failed samples directory, or I can provide the hash:

04e47919de5908edfa1fddc522a811d56bc67a1d4020f8b3972709e25b15966c.ts

If the issue is fundamental (e.g., the sample uses a non-standard PMT structure), document it as a known limitation. If it's fixable, please fix it.

Everything else looks good

25 CI tests: all identical to master
13 single-DVB samples: all identical to master
CHANGES.TXT: present
Filename regression: fixed

cfsmp3 requested changes Apr 4, 2026

View reviewed changes

ujjwalr27 added 4 commits April 7, 2026 14:41

feat: auto-extract multi-language DVB subtitles into per-language files

0260612

style: apply clang-format to multi-DVB subtitle extraction changes

a26cce4

fix: only add lang suffix when 2+ DVB PID

510381b

fix segfault from uninitialized dvb_lang

a4d10b9

ujjwalr27 force-pushed the feature/multi-dvb-subtitle-extraction branch from 63638f0 to a4d10b9 Compare April 7, 2026 05:42

ujjwalr27 added 2 commits April 7, 2026 14:54

docs: add CHANGES.TXT entry for multi-language DVB subtitle extraction

4dd2c83

fix: skip non-DVB encoders in lookup

00a717a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243
ujjwalr27 wants to merge 6 commits intoCCExtractor:masterfrom
ujjwalr27:feature/multi-dvb-subtitle-extraction

ujjwalr27 commented Mar 30, 2026

Uh oh!

cfsmp3 left a comment

Uh oh!

ccextractor-bot commented Apr 7, 2026

Uh oh!

ccextractor-bot commented Apr 7, 2026

Uh oh!

cfsmp3 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ujjwalr27 commented Mar 30, 2026

Description

Repro Instructions

Test 1 — arte_multiaudio.ts (from issue #447)

Test 2 — DVB-only file with two subtitle streams (deu + fra)

Implementation

Files changed

Key design decisions

Uh oh!

cfsmp3 left a comment

Choose a reason for hiding this comment

Deep Review Results

What works well

Issue found: filename regression on single-DVB-stream files

Also needed

Uh oh!

ccextractor-bot commented Apr 7, 2026

Uh oh!

ccextractor-bot commented Apr 7, 2026

Uh oh!

cfsmp3 commented Apr 8, 2026

Re-test Results (Apr 7)

Multi-DVB feature testing

What we need

Everything else looks good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test 1 — `arte_multiaudio.ts` (from issue #447)

Test 2 — DVB-only file with two subtitle streams (`deu` + `fra`)