Skip to content

CSS support for formatting styles #20

@sgtatham

Description

@sgtatham

There are situations in which it would be useful for html2text to understand at least a small amount of CSS.

An occasional annoyance I find with some web pages is that they use different classes of <span> (or <div>, depending on preference) for all their formatting, including both paragraph separation and inline style changes such as emphasis. Then they rely on CSS to make some of those span classes behave like <p>, some like <em>, some like <code> and so on.

html2text can't render a document of that kind sensibly without having to speak enough CSS to at least know which classes of <span> it should treat like which normal tags. You end up with a huge megaparagraph, or alternatively no end of spurious newlines (depending on whether the author went all-spans or all-divs).

I don't have a real-world example handy, but here's one I mocked up manually:

<head>
<title>Demo of the 'spans-everywhere' school of HTML</title>
<style type="text/css">
.p { display: block; margin-bottom: 1em; }
.em { font-style: italic; }
.code { font-family: monospace; }
</style>
</head>
<body>
<span class="p">Paragraph one, containing <span class="em">emphasis</span>.</span><span class="p">Paragraph two, containing <span class="code">code</span>.</span>
</body>
</html>

@jugglerchris mentioned that another use case is pages that use display: none.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions