Text Processing Rules

mdka applies a small set of deterministic rules to produce consistent, readable Markdown from any HTML text content.

Whitespace Normalisation

HTML text nodes are normalised according to the HTML whitespace collapsing rules:

Leading and trailing whitespace is trimmed from block-level context.
Consecutive whitespace characters (spaces, tabs, newlines) within a text node are collapsed to a single space.
A single space is preserved between adjacent inline elements.
<br> produces a hard line break ( \n).
<pre> blocks are exempt — whitespace inside <pre> is reproduced exactly.

This is done in a single pass without regular expressions, which keeps allocation overhead low.

Markdown Character Escaping

To prevent accidental Markdown formatting, the following characters are escaped with a backslash when they appear in text content that is not inside a code span or code block:

Character	Escaped as	Context
`*`	`\*`	Would create emphasis
`_`	`\_`	Would create emphasis
`	\`	Would start a code span
`#`	`\#`	At the start of a line, would create a heading
`[`	`\[`	Would start a link
`!`	`\!`	Before `[`, would start an image
`\`	`\\`	The escape character itself

Escaping is context-aware: a # in the middle of a line is not escaped, only at the start of a line where it would be interpreted as an ATX heading.

HTML Entity Decoding

HTML entities in text nodes are decoded by the HTML parser (scraper / html5ever) before mdka processes them. The result is already Unicode text:

HTML entity	After parsing	In Markdown
`&`	`&`	`&`
`<`	`<`	`<`
`>`	`>`	`>`
` `	non-breaking space	preserved as space

Output Boundaries

Output always ends with exactly one newline (\n) when the input produces any content; the output is empty for empty input.
Leading blank lines that scraper adds when wrapping content in <html><body> are trimmed before the final string is returned.
Block elements (paragraphs, headings, lists, etc.) are separated by blank lines.

Keyboard shortcuts

mdka — HTML to Markdown converter

Text Processing Rules

Whitespace Normalisation

Markdown Character Escaping

HTML Entity Decoding

Output Boundaries