Architecture
Workspace Layout
mdka/
├── src/ mdka library crate (lib only)
│ ├── lib.rs Public API surface
│ ├── options.rs ConversionMode, ConversionOptions
│ ├── traversal.rs Markdown conversion traversal
│ ├── renderer.rs MarkdownRenderer state machine
│ ├── utils.rs Whitespace normalisation + escaping
│ └── alloc_counter.rs Custom allocator (for benchmarks)
├── tests/ integration test modules
│ └── utils/preprocessor.rs DOM pre-processing pipeline
├── cli/ mdka-cli binary crate
│ └── src/main.rs Argument parsing + dispatch
├── node/ Node.js bindings (napi-rs v3)
├── python/ Python bindings (PyO3 v0)
├── benches/ criterion benchmarks
└── examples/ Allocation measurement tool
Conversion Pipeline
Each call to html_to_markdown_with follows these steps:
HTML string
│
▼
[1] Parse scraper::Html::parse_document()
│ → html5ever DOM tree (tolerant HTML5 parsing)
▼
[2] Pre-process preprocessor::preprocess(&doc, opts)
│ → filtered HTML string
│ Non-recursive DFS over ego-tree nodes
│ Drops: script, style, iframe, …
│ Filters attributes per ConversionOptions
│ Removes shell elements (if opted in)
│ Unwraps anonymous wrappers (if opted in)
▼
[3] Re-parse scraper::Html::parse_document(&cleaned)
│ → clean DOM for conversion
▼
[4] Convert traversal::traverse(&doc)
│ → Markdown string
│ Non-recursive DFS with Enter/Leave events
│ Drives MarkdownRenderer via event callbacks
▼
[5] Finalise renderer.finish()
→ trim leading/trailing whitespace
→ ensure single trailing newline
MarkdownRenderer
MarkdownRenderer is a state machine that maintains:
output: the accumulated Markdown stringlist_stack: tracks nested ordered/unordered listsblockquote_depth: counts blockquote nesting levelin_pre: whether inside a<pre>blockat_line_start: deferred prefix flag for blockquote>emissionnewlines_emitted: prevents double-blank-line accumulation
The at_line_start flag is key: rather than emitting > prefixes
immediately when entering a blockquote, the renderer defers them until
actual content is written. This ensures nested blockquotes emit the
correct number of > characters regardless of how many block elements
intervene.
Language Bindings
Both the Node.js and Python bindings are thin wrappers:
- Node.js (napi-rs): exposes sync and async (
tokio::spawn_blocking) variants. The async variants release the Node.js event loop during conversion. - Python (PyO3): exposes
py.detach()on the batch functionhtml_to_markdown_many, releasing the GIL for rayon parallel conversion.
The binding crates (mdka-node, mdka-python) have no conversion logic
of their own — they call the same Rust functions as the library and CLI.