Design Philosophy

The Goal: Balance, not Dominance

There are excellent HTML-to-Markdown libraries in the Rust ecosystem — some prioritise raw speed, others maximise conversion fidelity. mdka is not trying to beat them on every axis.

Its aim is a practical balance:

Produce stable, readable Markdown from real-world HTML, with an easy API, without surprising the caller at runtime.

Speed and memory efficiency matter, and mdka is designed with both in mind. But they are means to an end, not the end itself.

Real-World HTML is Messy

Web content rarely arrives as clean, well-formed documents. In practice you encounter:

HTML that a CMS generated and no human ever wrote
SPA-rendered DOM fragments extracted from DevTools
Scraped pages with ad slots, cookie banners, and navigation wrapped around the content
Documents with 5,000 levels of nested <div> elements
Missing closing tags, duplicate attributes, and unknown elements

mdka uses scraper, which is built on html5ever — the same parser used by the Servo browser engine. It applies the HTML5 parsing algorithm, meaning: unknown elements are handled gracefully, missing tags are inferred, and the result is always a well-formed DOM tree, regardless of the input quality.

No Stack Overflows

A common failure mode in tree-processing code is stack overflow on deeply nested input. mdka uses an explicit Vec-based stack (non-recursive DFS) for every tree traversal — both in the pre-processing pipeline and in the Markdown conversion step. This means it handles any nesting depth that fits in heap memory.

Configurable Pre-Processing

HTML from different sources needs different treatment. A page scraped from a news site has navigation, advertising, and footer content that a content extraction pipeline wants to remove. A document being archived for audit purposes should retain as much as possible.

The five conversion modes encode these intent differences as named, opinionated presets. They are applied in a pre-processing pass that filters the DOM before Markdown conversion runs — keeping the conversion logic itself simple and mode-agnostic.

One Allocator, Minimal Copies

The conversion pipeline is designed to minimise heap allocations:

Whitespace normalisation is done in a single pass, writing directly into the output String.
No regular expressions are used at runtime (avoiding compiled regex objects).
The output String is pre-allocated with a capacity estimate.
The #[global_allocator] counter in the CLI and benchmarks measures this directly.