This tap provides the justhtml CLI via Homebrew.
justhtml is an HTML5 parser CLI with CSS selectors and full html5lib compliance.
brew install diffen/justhtml/justhtmljusthtml --versionThe section below is synced from diffen/justhtml-php/CLI.md.
Commands are rewritten to use justhtml for Homebrew.
The justhtml CLI parses HTML, optionally selects nodes with a CSS selector, and outputs HTML, text, or Markdown.
It accepts either a file path or - for stdin.
Run it:
- From this repo:
justhtml - From a Composer install:
justhtml
Create a small input file:
cat > sample.html <<'HTML'
<!doctype html>
<html>
<body>
<article id="post">
<h1>Title</h1>
<p class="lead">Hello <em>world</em>!</p>
<p>Second <span>para</span>.</p>
</article>
</body>
</html>
HTMLCreate a whitespace-focused file:
cat > whitespace.html <<'HTML'
<!doctype html>
<html><body>
<p class="sep">Alpha<span>Beta</span>Gamma</p>
<p class="ws"> Hello <span> world </span> ! </p>
</body></html>
HTMLSelect matching nodes (single selector):
justhtml sample.html --selector "p.lead" --format textOutput:
Hello world!
Select multiple selectors with a comma-separated list:
justhtml sample.html --selector "h1, p.lead" --format textOutput:
Title
Hello world!
Choose output format: html, text, or markdown.
HTML output:
justhtml sample.html --selector "p.lead" --format htmlOutput:
<p class="lead">
Hello
<em>world</em>
!
</p>Text output:
justhtml sample.html --selector "p.lead" --format textOutput:
Hello world!
Markdown output:
justhtml sample.html --selector "p.lead" --format markdownOutput:
Hello *world*!
HTML output uses outer HTML by default. Use --inner to print only the
matched node's children (inner HTML). --outer is a no-op that makes the
default explicit. These flags only affect --format html.
justhtml sample.html --selector "p.lead" --format html --innerOutput:
Hello
<em>world</em>
!Extract attribute values from matched nodes. Repeat --attr to output multiple
attributes per node (tab-separated by default). Missing attributes are replaced
with __MISSING__ by default; override with --missing.
justhtml sample.html --selector "p" --attr class --attr idOutput (tab-separated):
lead __MISSING__
__MISSING__ __MISSING__
Use --separator to change the field separator:
justhtml sample.html --selector "p" --attr class --attr id --separator ","--attr cannot be combined with --format, --inner, --outer, or --count.
Limit to the first match:
justhtml sample.html --selector "p" --format textOutput:
Hello world!
Second para.
justhtml sample.html --selector "p" --format text --firstOutput:
Hello world!
--first is equivalent to --limit 1 and cannot be combined with --limit.
Limit to the first N matches. This is equivalent to --first when N is 1.
justhtml sample.html --selector "p" --format text --limit 2Output:
Hello world!
Second para.
Print the number of matching nodes:
justhtml sample.html --selector "p" --countOutput:
2
--count cannot be combined with --first, --limit, --format, or --attr.
Join text nodes with a custom separator (text output only). In --attr mode,
this controls the field separator (default: tab).
justhtml whitespace.html --selector ".sep" --format textOutput:
Alpha Beta Gamma
justhtml whitespace.html --selector ".sep" --format text --separator ""Output:
AlphaBetaGamma
By default, each text node is trimmed and empty nodes are dropped (--strip).
Use --no-strip to preserve the original whitespace within text nodes.
Default (strip on):
justhtml whitespace.html --selector ".ws" --format textOutput:
Hello world !
Preserve whitespace:
justhtml whitespace.html --selector ".ws" --format text --no-stripOutput (spaces shown between | markers):
| Hello world ! |
Read from stdin by passing - as the path:
cat sample.html | justhtml - --selector "p.lead" --format textOutput:
Hello world!
These examples use a live page and pipe HTML into justhtml.
# Extract the first non-empty paragraph as text
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "#mw-content-text p:not(:empty)" --format text --first
# Extract links from the lead section (first 10 hrefs)
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "#mw-content-text p a" --attr href --limit 10 --separator "\n"
# Get the lead section as Markdown
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "#mw-content-text" --format markdown --first
# Count images on the page
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "img" --count
# Output the infobox as HTML (outer HTML)
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "table.infobox" --format html --outer --first
# Preserve whitespace and separate paragraphs
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "#mw-content-text p" --format text --no-strip --separator "\n\n" --limit 3
# Build a quick table of contents from headings
curl -s https://en.wikipedia.org/wiki/Earth | \
justhtml - --selector "#mw-content-text h2, #mw-content-text h3" --format text --separator "\n"justhtml --versionOutput:
justhtml dev
justhtml --helpOutput: prints the full usage/help text.
brew upgrade justhtmlbrew uninstall justhtmlIf you installed via the tap and want to remove it:
brew untap diffen/justhtmlMake sure your Homebrew prefix is on PATH:
brew --prefixThen ensure $(brew --prefix)/bin is on your PATH.
If you see an Xdebug warning from your PHP configuration, you can disable it for a single run:
XDEBUG_MODE=off justhtml --versionThe formula lives at:
Formula/justhtml.rb
MIT