Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Text Processing Rules

mdka applies a small set of deterministic rules to produce consistent, readable Markdown from any HTML text content.

Whitespace Normalisation

HTML text nodes are normalised according to the HTML whitespace collapsing rules:

  • Leading and trailing whitespace is trimmed from block-level context.
  • Consecutive whitespace characters (spaces, tabs, newlines) within a text node are collapsed to a single space.
  • A single space is preserved between adjacent inline elements.
  • <br> produces a hard line break ( \n).
  • <pre> blocks are exempt — whitespace inside <pre> is reproduced exactly.

This is done in a single pass without regular expressions, which keeps allocation overhead low.

Markdown Character Escaping

To prevent accidental Markdown formatting, the following characters are escaped with a backslash when they appear in text content that is not inside a code span or code block:

CharacterEscaped asContext
*\*Would create emphasis
_\_Would create emphasis
`\`Would start a code span
#\#At the start of a line, would create a heading
[\[Would start a link
!\!Before [, would start an image
\\\The escape character itself

Escaping is context-aware: a # in the middle of a line is not escaped, only at the start of a line where it would be interpreted as an ATX heading.

HTML Entity Decoding

HTML entities in text nodes are decoded by the HTML parser (scraper / html5ever) before mdka processes them. The result is already Unicode text:

HTML entityAfter parsingIn Markdown
&amp;&&
&lt;<<
&gt;>>
&nbsp;non-breaking spacepreserved as space

Output Boundaries

  • Output always ends with exactly one newline (\n) when the input produces any content; the output is empty for empty input.
  • Leading blank lines that scraper adds when wrapping content in <html><body> are trimmed before the final string is returned.
  • Block elements (paragraphs, headings, lists, etc.) are separated by blank lines.