Streaming is the low-level tokenizer interface. It yields events as HTML is tokenized, without building a DOM tree. This is event streaming, not chunked file reading. You still pass the full HTML string (or byte buffer) in memory.
Pros:
- Lower memory use (no DOM)
- Can stop early once you find what you need
- Good for scans, counters, and simple extraction
Cons:
- No CSS selectors
- Manual state tracking
- No tree-builder fixes (implicit end tags are not inserted)
If you need selectors, DOM traversal, or spec-correct tree fixes, use the full parser (new JustHTML(...)).
use JustHTML\Stream;
$html = '<p class="lead">Hello <em>world</em>!</p>';
foreach (Stream::stream($html) as [$event, $data]) {
echo $event . ' -> ' . json_encode($data) . "\n";
}Each yielded value is a two-item array: [$event, $data].
start:[$tagName, $attrs]$tagNameis a lowercase tag name (string)$attrsis an associative array of attribute name to value (string) or null for boolean attrs
end:$tagNametext: the text content (string). Adjacent text chunks are merged.comment: the comment body (string, without<!-- -->)doctype:[$name, $publicId, $systemId](strings or null)
Notes:
- Void elements like
img,br,metaemit only astartevent (noendevent). - Character references are decoded in text and attribute values.
- The stream does not expose parse errors; it is raw tokenizer output, not the tree builder.
Input:
<!doctype html>
<p class="lead">Hello <em>world</em><!-- note -->!</p>Output (events shown as event -> data):
doctype -> ["html",null,null]
start -> ["p",{"class":"lead"}]
text -> "Hello "
start -> ["em",[]]
text -> "world"
end -> "em"
comment -> " note "
text -> "!"
end -> "p"
The full parser version (non-empty means non-whitespace text):
use JustHTML\JustHTML;
$doc = new JustHTML($html);
$leadText = '';
foreach ($doc->query('#mw-content-text p') as $p) {
$text = $p->toText();
if ($text !== '') {
$leadText = $text;
break;
}
}A streaming version that matches toText() defaults (trim text nodes and join with a space).
Here, "non-empty" means at least one non-whitespace text segment:
use JustHTML\Stream;
function append_text_segment(array &$segments, string $data, bool &$hasText): void
{
$text = trim($data);
if ($text === '') {
return;
}
$segments[] = $text;
$hasText = true;
}
$inContainer = false;
$containerDepth = 0;
$capturing = false;
$hasText = false;
$segments = [];
$separator = ' ';
foreach (Stream::stream($html) as [$event, $data]) {
if ($event === 'start') {
[$tag, $attrs] = $data;
$tag = (string)$tag;
$attrs = is_array($attrs) ? $attrs : [];
if (!$inContainer) {
if ($tag === 'div' && (($attrs['id'] ?? '') === 'mw-content-text')) {
$inContainer = true;
$containerDepth = 1;
}
continue;
}
$containerDepth += 1;
if ($tag === 'p') {
if ($capturing && $hasText) {
break;
}
$capturing = true;
$hasText = false;
$segments = [];
}
continue;
}
if ($event === 'end') {
$tag = (string)$data;
if ($inContainer) {
$containerDepth -= 1;
if ($containerDepth <= 0) {
$inContainer = false;
}
}
if ($capturing && $tag === 'p') {
if ($hasText) {
break;
}
$capturing = false;
$hasText = false;
$segments = [];
}
continue;
}
if ($event === 'text' && $capturing) {
append_text_segment($segments, $data, $hasText);
}
}
$text = $segments ? implode($separator, $segments) : '';This includes a simple heuristic that treats a new <p> start tag as the end of the previous paragraph,
since streaming does not insert implicit end tags. If your HTML relies heavily on implicit tag rules,
you may need extra logic.
Run this snippet on the Wikipedia fixture to compare streaming vs full parser:
use JustHTML\JustHTML;
use JustHTML\Stream;
$html = file_get_contents('examples/fixtures/wikipedia-earth.html');
function full_parse(string $html): string
{
$doc = new JustHTML($html);
foreach ($doc->query('#mw-content-text p') as $p) {
$text = $p->toText();
if ($text !== '') {
return $text;
}
}
return '';
}
function stream_parse(string $html): string
{
$inContainer = false;
$containerDepth = 0;
$capturing = false;
$hasText = false;
$segments = [];
foreach (Stream::stream($html) as [$event, $data]) {
if ($event === 'start') {
[$tag, $attrs] = $data;
$tag = (string)$tag;
$attrs = is_array($attrs) ? $attrs : [];
if (!$inContainer) {
if ($tag === 'div' && (($attrs['id'] ?? '') === 'mw-content-text')) {
$inContainer = true;
$containerDepth = 1;
}
continue;
}
$containerDepth += 1;
if ($tag === 'p') {
if ($capturing && $hasText) {
break;
}
$capturing = true;
$hasText = false;
$segments = [];
}
continue;
}
if ($event === 'end') {
$tag = (string)$data;
if ($inContainer) {
$containerDepth -= 1;
if ($containerDepth <= 0) {
$inContainer = false;
}
}
if ($capturing && $tag === 'p') {
if ($hasText) {
break;
}
$capturing = false;
$hasText = false;
$segments = [];
}
continue;
}
if ($event === 'text' && $capturing) {
$text = trim($data);
if ($text === '') {
continue;
}
$segments[] = $text;
$hasText = true;
}
}
return $segments ? implode(' ', $segments) : '';
}
function bench(callable $fn, int $iterations): float
{
$total = 0.0;
for ($i = 0; $i < $iterations; $i++) {
$t0 = hrtime(true);
$fn();
$t1 = hrtime(true);
$total += ($t1 - $t0) / 1e6;
}
return $total / $iterations;
}
$iterations = 5;
full_parse($html);
stream_parse($html);
$avgFull = bench(fn() => full_parse($html), $iterations);
$avgStream = bench(fn() => stream_parse($html), $iterations);
printf("Full parser avg: %.2f ms\n", $avgFull);
printf("Stream avg: %.2f ms\n", $avgStream);Example results on one machine for lead paragraph extraction (PHP 8.5.1, 5-run average, Wikipedia fixture):
Full parser avg: 120.06 ms
Stream avg: 13.09 ms
Your results will vary by machine and PHP version.
$links = [];
foreach (Stream::stream($html) as [$event, $data]) {
if ($event !== 'start') {
continue;
}
[$tag, $attrs] = $data;
if ($tag === 'a' && isset($attrs['href'])) {
$links[] = $attrs['href'];
}
}$skipDepth = 0;
foreach (Stream::stream($html) as [$event, $data]) {
if ($event === 'start') {
[$tag, $_] = $data;
if ($tag === 'script' || $tag === 'style') {
$skipDepth += 1;
}
continue;
}
if ($event === 'end') {
$tag = $data;
if ($skipDepth > 0 && ($tag === 'script' || $tag === 'style')) {
$skipDepth -= 1;
}
continue;
}
if ($event === 'text' && $skipDepth === 0) {
// Process visible text here.
}
}$inArticle = false;
$depth = 0;
$text = [];
foreach (Stream::stream($html) as [$event, $data]) {
if ($event === 'start') {
[$tag, $attrs] = $data;
if (!$inArticle && $tag === 'article') {
$inArticle = true;
$depth = 1;
continue;
}
if ($inArticle) {
$depth += 1;
}
continue;
}
if ($event === 'end') {
if ($inArticle) {
$depth -= 1;
if ($depth <= 0) {
break;
}
}
continue;
}
if ($event === 'text' && $inArticle) {
$text[] = $data;
}
}
$articleText = trim(implode('', $text));$parts = [];
foreach (Stream::stream($html) as [$event, $data]) {
if ($event === 'text') {
$parts[] = $data;
}
}
$plain = preg_replace('/\s+/', ' ', trim(implode('', $parts)));- Streaming does not support CSS selectors. You must implement matching logic using tags and attributes.
- The stream reflects tokenizer output. It does not fix malformed HTML or insert implicit end tags.
- For very large files, streaming avoids DOM memory use, but the input still has to fit in memory.
Stream::stream($html, $encoding = null, $bytes = false) accepts either text or raw bytes.
If $bytes is true, the input is decoded to UTF-8 using the provided encoding or HTML sniffing.