Skip to content

diffen/homebrew-justhtml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Homebrew Tap for justhtml

This tap provides the justhtml CLI via Homebrew.

justhtml is an HTML5 parser CLI with CSS selectors and full html5lib compliance.

Install

brew install diffen/justhtml/justhtml

Verify

justhtml --version

CLI Documentation

The section below is synced from diffen/justhtml-php/CLI.md. Commands are rewritten to use justhtml for Homebrew.

CLI

The justhtml CLI parses HTML, optionally selects nodes with a CSS selector, and outputs HTML, text, or Markdown. It accepts either a file path or - for stdin.

Run it:

  • From this repo: justhtml
  • From a Composer install: justhtml

Sample input used below

Create a small input file:

cat > sample.html <<'HTML'
<!doctype html>
<html>
  <body>
    <article id="post">
      <h1>Title</h1>
      <p class="lead">Hello <em>world</em>!</p>
      <p>Second <span>para</span>.</p>
    </article>
  </body>
</html>
HTML

Create a whitespace-focused file:

cat > whitespace.html <<'HTML'
<!doctype html>
<html><body>
  <p class="sep">Alpha<span>Beta</span>Gamma</p>
  <p class="ws">  Hello <span> world </span> ! </p>
</body></html>
HTML

--selector

Select matching nodes (single selector):

justhtml sample.html --selector "p.lead" --format text

Output:

Hello world!

Select multiple selectors with a comma-separated list:

justhtml sample.html --selector "h1, p.lead" --format text

Output:

Title
Hello world!

--format

Choose output format: html, text, or markdown.

HTML output:

justhtml sample.html --selector "p.lead" --format html

Output:

<p class="lead">
  Hello
  <em>world</em>
  !
</p>

Text output:

justhtml sample.html --selector "p.lead" --format text

Output:

Hello world!

Markdown output:

justhtml sample.html --selector "p.lead" --format markdown

Output:

Hello *world*!

--outer / --inner

HTML output uses outer HTML by default. Use --inner to print only the matched node's children (inner HTML). --outer is a no-op that makes the default explicit. These flags only affect --format html.

justhtml sample.html --selector "p.lead" --format html --inner

Output:

Hello
<em>world</em>
!

--attr / --missing

Extract attribute values from matched nodes. Repeat --attr to output multiple attributes per node (tab-separated by default). Missing attributes are replaced with __MISSING__ by default; override with --missing.

justhtml sample.html --selector "p" --attr class --attr id

Output (tab-separated):

lead	__MISSING__
__MISSING__	__MISSING__

Use --separator to change the field separator:

justhtml sample.html --selector "p" --attr class --attr id --separator ","

--attr cannot be combined with --format, --inner, --outer, or --count.

--first

Limit to the first match:

justhtml sample.html --selector "p" --format text

Output:

Hello world!
Second para.
justhtml sample.html --selector "p" --format text --first

Output:

Hello world!

--first is equivalent to --limit 1 and cannot be combined with --limit.

--limit

Limit to the first N matches. This is equivalent to --first when N is 1.

justhtml sample.html --selector "p" --format text --limit 2

Output:

Hello world!
Second para.

--count

Print the number of matching nodes:

justhtml sample.html --selector "p" --count

Output:

2

--count cannot be combined with --first, --limit, --format, or --attr.

--separator

Join text nodes with a custom separator (text output only). In --attr mode, this controls the field separator (default: tab).

justhtml whitespace.html --selector ".sep" --format text

Output:

Alpha Beta Gamma
justhtml whitespace.html --selector ".sep" --format text --separator ""

Output:

AlphaBetaGamma

--strip / --no-strip

By default, each text node is trimmed and empty nodes are dropped (--strip). Use --no-strip to preserve the original whitespace within text nodes.

Default (strip on):

justhtml whitespace.html --selector ".ws" --format text

Output:

Hello world !

Preserve whitespace:

justhtml whitespace.html --selector ".ws" --format text --no-strip

Output (spaces shown between | markers):

|  Hello   world   ! |

Stdin

Read from stdin by passing - as the path:

cat sample.html | justhtml - --selector "p.lead" --format text

Output:

Hello world!

Piping examples (real pages)

These examples use a live page and pipe HTML into justhtml.

# Extract the first non-empty paragraph as text
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "#mw-content-text p:not(:empty)" --format text --first

# Extract links from the lead section (first 10 hrefs)
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "#mw-content-text p a" --attr href --limit 10 --separator "\n"

# Get the lead section as Markdown
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "#mw-content-text" --format markdown --first

# Count images on the page
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "img" --count

# Output the infobox as HTML (outer HTML)
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "table.infobox" --format html --outer --first

# Preserve whitespace and separate paragraphs
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "#mw-content-text p" --format text --no-strip --separator "\n\n" --limit 3

# Build a quick table of contents from headings
curl -s https://en.wikipedia.org/wiki/Earth | \
  justhtml - --selector "#mw-content-text h2, #mw-content-text h3" --format text --separator "\n"

--version and --help

justhtml --version

Output:

justhtml dev
justhtml --help

Output: prints the full usage/help text.

Upgrading

brew upgrade justhtml

Uninstall

brew uninstall justhtml

If you installed via the tap and want to remove it:

brew untap diffen/justhtml

Troubleshooting

“justhtml: command not found”

Make sure your Homebrew prefix is on PATH:

brew --prefix

Then ensure $(brew --prefix)/bin is on your PATH.

Xdebug warning on justhtml --version

If you see an Xdebug warning from your PHP configuration, you can disable it for a single run:

XDEBUG_MODE=off justhtml --version

Formula

The formula lives at:

  • Formula/justhtml.rb

License

MIT

About

Homebrew tap for justhtml

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors