Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Architecture

Workspace Layout

mdka/
├── src/               mdka library crate (lib only)
│   ├── lib.rs             Public API surface
│   ├── options.rs         ConversionMode, ConversionOptions
│   ├── traversal.rs       Markdown conversion traversal
│   ├── renderer.rs        MarkdownRenderer state machine
│   ├── utils.rs           Whitespace normalisation + escaping
│   └── alloc_counter.rs   Custom allocator (for benchmarks)
├── tests/             integration test modules
│   └── utils/preprocessor.rs    DOM pre-processing pipeline
├── cli/               mdka-cli binary crate
│   └── src/main.rs        Argument parsing + dispatch
├── node/              Node.js bindings (napi-rs v3)
├── python/            Python bindings (PyO3 v0)
├── benches/           criterion benchmarks
└── examples/          Allocation measurement tool

Conversion Pipeline

Each call to html_to_markdown_with follows these steps:

HTML string
    │
    ▼
[1] Parse          scraper::Html::parse_document()
    │               → html5ever DOM tree (tolerant HTML5 parsing)
    ▼
[2] Pre-process    preprocessor::preprocess(&doc, opts)
    │               → filtered HTML string
    │               Non-recursive DFS over ego-tree nodes
    │               Drops: script, style, iframe, …
    │               Filters attributes per ConversionOptions
    │               Removes shell elements (if opted in)
    │               Unwraps anonymous wrappers (if opted in)
    ▼
[3] Re-parse       scraper::Html::parse_document(&cleaned)
    │               → clean DOM for conversion
    ▼
[4] Convert        traversal::traverse(&doc)
    │               → Markdown string
    │               Non-recursive DFS with Enter/Leave events
    │               Drives MarkdownRenderer via event callbacks
    ▼
[5] Finalise       renderer.finish()
                    → trim leading/trailing whitespace
                    → ensure single trailing newline

MarkdownRenderer

MarkdownRenderer is a state machine that maintains:

  • output: the accumulated Markdown string
  • list_stack: tracks nested ordered/unordered lists
  • blockquote_depth: counts blockquote nesting level
  • in_pre: whether inside a <pre> block
  • at_line_start: deferred prefix flag for blockquote > emission
  • newlines_emitted: prevents double-blank-line accumulation

The at_line_start flag is key: rather than emitting > prefixes immediately when entering a blockquote, the renderer defers them until actual content is written. This ensures nested blockquotes emit the correct number of > characters regardless of how many block elements intervene.

Language Bindings

Both the Node.js and Python bindings are thin wrappers:

  • Node.js (napi-rs): exposes sync and async (tokio::spawn_blocking) variants. The async variants release the Node.js event loop during conversion.
  • Python (PyO3): exposes py.detach() on the batch function html_to_markdown_many, releasing the GIL for rayon parallel conversion.

The binding crates (mdka-node, mdka-python) have no conversion logic of their own — they call the same Rust functions as the library and CLI.