Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

mdka

mdka is a HTML to Markdown written in Rust. “ka” means “化 (か)” pointing to conversion.

It aims to strike a practical balance between conversion quality and runtime efficiency — readable output from real-world HTML, without sacrificing speed or memory.

At a Glance

What you give itWhat you get back
Any HTML string — a full page, a snippet, CMS output, SPA-rendered DOMClean, readable Markdown
A list of HTML filesParallel Markdown output via rayon
A conversion mode (minimal, semantic, …)Pre-processed output tuned for your use case

Key Properties

  • Parser foundation: scraper, which is built on html5ever — the same battle-tested parser used by the Servo browser engine. It handles malformed, deeply-nested, and real-world HTML gracefully.
  • Crash-resistant: a non-recursive DFS traversal means even 10,000 levels of nesting will not overflow the stack.
  • Configurable: five conversion modes let you tune the pre-processing pipeline — from noise-free LLM input to lossless archiving.
  • Multi-language: available as a Rust library, a Node.js package (napi-rs), and a Python package (PyO3).

When to Choose mdka

mdka is a good fit if you need:

  • Stable, predictable output from diverse HTML sources (CMS, SPA, scraped pages)
  • Mode-based pre-processing to strip navigation, preserve ARIA, or retain attributes
  • Memory efficiency at scale (bulk file conversion, streaming pipelines)
  • Multi-language access from a single underlying Rust implementation

If raw speed on simple, well-formed HTML is the only concern, a streaming rewriter will be faster.

Quick Navigation

Installation

As a Rust Library

Add mdka to your Cargo.toml:

[dependencies]
mdka = "2"

That is the only step. mdka has no system dependencies.

Minimum Supported Rust Version: 1.85 (2024 Edition)

As a CLI Binary

Build from source using the mdka-cli crate in the workspace:

git clone https://github.com/example/mdka
cd mdka
cargo build --release -p mdka-cli
# Binary: ./target/release/mdka

Or install directly with cargo:

cargo install mdka-cli

As a Node.js Package

npm install mdka
# or
yarn add mdka

Requires Node.js 16 or later.
Pre-built binaries are bundled for major platforms such as Linux, macOS and Windows of specific architecture.
On other platforms, run npm run build with Rust installed.

As a Python Package

pip install mdka

Requires Python 3.8 or later. Pre-built wheels are provided for CPython on major platforms. To build from source: pip install mdka --no-binary mdka with Rust installed.

Usage & Examples

Choose the section for your environment:

  • Rust — integrate directly into a Rust project
  • Node.js — use from JavaScript or TypeScript
  • Python — use from Python
  • CLI — use from the command line

All four share the same underlying conversion engine, so results are consistent across languages.

Usage — Rust

Basic Conversion

use mdka::html_to_markdown;

fn main() {
    let html = r#"
        <h1>Getting Started</h1>
        <p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
        <ul>
            <li>Fast</li>
            <li>Configurable</li>
            <li>Crash-resistant</li>
        </ul>
    "#;

    let md = html_to_markdown(html);
    println!("{md}");
}

Output:

# Getting Started

mdka converts **HTML** to *Markdown*.

- Fast
- Configurable
- Crash-resistant

Conversion with Options

Use html_to_markdown_with to control the conversion pipeline via ConversionOptions.

#![allow(unused)]
fn main() {
use mdka::{html_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};

// Strip navigation and extract body text — good for LLM input
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;

let html = r#"
    <header><nav><a href="/">Home</a></nav></header>
    <main>
        <article>
            <h1>Article Title</h1>
            <p>The main content of the page.</p>
        </article>
    </main>
    <footer>Copyright 2025</footer>
"#;

let md = html_to_markdown_with(html, &opts);
assert!(md.contains("# Article Title"));
assert!(!md.contains("Home"));       // nav removed
assert!(!md.contains("Copyright"));  // footer removed
}

Converting a Single File

#![allow(unused)]
fn main() {
use mdka::html_file_to_markdown;

// Output goes to the same directory as the input: page.html → page.md
let result = html_file_to_markdown("page.html", None::<&str>)?;
println!("{} → {}", result.src.display(), result.dest.display());

// Output goes to a specific directory
let result = html_file_to_markdown("page.html", Some("out/"))?;
}

Bulk Parallel Conversion

#![allow(unused)]
fn main() {
use mdka::html_files_to_markdown;
use std::path::Path;

let files = vec!["a.html", "b.html", "c.html"];
let out_dir = Path::new("out/");
std::fs::create_dir_all(out_dir)?;

for (src, result) in html_files_to_markdown(&files, out_dir) {
    match result {
        Ok(dest) => println!("{} → {}", src, dest.display()),
        Err(e)   => eprintln!("Error: {src}: {e}"),
    }
}
}

Conversion runs in parallel using rayon. The number of threads defaults to the number of logical CPU cores.

Bulk Conversion with Options

#![allow(unused)]
fn main() {
use mdka::{html_files_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};
use std::path::Path;

let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let files = vec!["a.html", "b.html"];
let results = html_files_to_markdown_with(&files, Path::new("out/"), &opts);
}

Conversion Modes at a Glance

ModeBest for
BalancedGeneral use; default
StrictDebugging, diff comparison
MinimalLLM pre-processing, compression
SemanticSPA content, accessibility-aware output
PreserveArchiving, audit trails

See Conversion Modes for full details.

Error Handling

#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};

match html_file_to_markdown("missing.html", None::<&str>) {
    Ok(result) => println!("→ {}", result.dest.display()),
    Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}

MdkaError currently has one variant: Io, wrapping std::io::Error. html_to_markdown and html_to_markdown_with are infallible — they always return a String and never panic on any input, no matter how malformed.

Usage — Node.js

Installation

npm install mdka

Basic Conversion

const { htmlToMarkdown } = require('mdka')

const html = `
  <h1>Hello</h1>
  <p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
`
const md = htmlToMarkdown(html)
console.log(md)
// # Hello
//
// mdka converts **HTML** to *Markdown*.

Async Conversion

htmlToMarkdownAsync offloads work to a Rust thread pool, keeping the Node.js event loop free:

const { htmlToMarkdownAsync } = require('mdka')

const md = await htmlToMarkdownAsync(html)

// Concurrent conversion of many pages
const results = await Promise.all(pages.map(p => htmlToMarkdownAsync(p.html)))

Conversion with Options

const { htmlToMarkdownWith, htmlToMarkdownWithAsync } = require('mdka')

// Strip nav/header/footer — useful for content extraction
const md = htmlToMarkdownWith(html, {
  mode: 'minimal',
  dropInteractiveShell: true,
})

// Async version
const md = await htmlToMarkdownWithAsync(html, { mode: 'semantic' })

Available mode strings: "balanced" (default), "strict", "minimal", "semantic", "preserve".

Single File Conversion

const { htmlFileToMarkdown, htmlFileToMarkdownWith } = require('mdka')

// Output to same directory: page.html → page.md
const result = await htmlFileToMarkdown('page.html')
console.log(`${result.src} → ${result.dest}`)

// Output to specific directory
const result = await htmlFileToMarkdown('page.html', 'out/')

// With options
const result = await htmlFileToMarkdownWith('page.html', 'out/', {
  mode: 'minimal',
  dropInteractiveShell: true,
})

Bulk Parallel Conversion

const { htmlFilesToMarkdown, htmlFilesToMarkdownWith } = require('mdka')
const path = require('path')

const files = ['a.html', 'b.html', 'c.html']
const results = await htmlFilesToMarkdown(files, 'out/')

for (const r of results) {
  if (r.error) console.error(`${r.src}: ${r.error}`)
  else         console.log(`${r.src} → ${r.dest}`)
}

// With options
const results = await htmlFilesToMarkdownWith(files, 'out/', {
  mode: 'semantic',
  preserveAriaAttrs: true,
})

TypeScript

Type definitions are bundled. No @types/ package is needed:

import {
  htmlToMarkdown,
  htmlToMarkdownWith,
  htmlToMarkdownAsync,
  htmlFileToMarkdown,
  htmlFilesToMarkdown,
  ConversionOptions,
  ConvertResult,
} from 'mdka'

const opts: ConversionOptions = {
  mode: 'minimal',
  dropInteractiveShell: true,
}
const md: string = htmlToMarkdownWith(html, opts)

Usage — Python

Installation

pip install mdka

Basic Conversion

import mdka

html = """
<h1>Hello</h1>
<p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
"""

md = mdka.html_to_markdown(html)
print(md)
# # Hello
#
# mdka converts **HTML** to *Markdown*.

Conversion with Options

import mdka

# Strip nav/header/footer — useful for LLM pre-processing
md = mdka.html_to_markdown_with(
    html,
    mode=mdka.ConversionMode.Minimal,
    drop_interactive_shell=True,
)

# Preserve ARIA attributes for accessibility-aware output
md = mdka.html_to_markdown_with(
    html,
    mode=mdka.ConversionMode.Semantic,
    preserve_aria_attrs=True,
)

Available modes: ConversionMode.Balanced (default), Strict, Minimal, Semantic, Preserve.

Parallel Batch Conversion (GIL released)

html_to_markdown_many releases the GIL and uses rayon for parallel conversion:

import mdka

pages = ["<h1>A</h1>", "<p>B</p>", "<ul><li>C</li></ul>"]
results = mdka.html_to_markdown_many(pages)
# ['# A\n', 'B\n', '- C\n']

This is faster than calling html_to_markdown in a Python loop for large batches.

Single File Conversion

import mdka

# Output to same directory: page.html → page.md
result = mdka.html_file_to_markdown("page.html")
print(f"{result.src} → {result.dest}")

# Output to a specific directory
result = mdka.html_file_to_markdown("page.html", "out/")

# With options
result = mdka.html_file_to_markdown(
    "page.html",
    "out/",
    mode=mdka.ConversionMode.Minimal,
    drop_interactive_shell=True,
)

Bulk File Conversion

import mdka

files = ["a.html", "b.html", "c.html"]
results = mdka.html_files_to_markdown(files, "out/")

for r in results:
    if r.ok:
        print(f"{r.src} → {r.dest}")
    else:
        print(f"Error: {r.src}: {r.error}")

Error Handling

import mdka

try:
    result = mdka.html_file_to_markdown("missing.html")
except mdka.MdkaError as e:
    print(f"Conversion failed: {e}")

MdkaError is raised for IO errors (file not found, permission denied, etc.). html_to_markdown and html_to_markdown_with are always safe to call — they never raise exceptions regardless of input quality.

Type Annotations

mdka ships with a py.typed marker (PEP 561). All public symbols are annotated:

from mdka import (
    html_to_markdown,          # (html: str) -> str
    html_to_markdown_with,     # (html: str, mode=..., **flags) -> str
    html_to_markdown_many,     # (html_list: list[str]) -> list[str]
    html_file_to_markdown,     # (path, out_dir=None, ...) -> ConvertResult
    html_files_to_markdown,    # (paths, out_dir, ...) -> list[BulkConvertResult]
    ConversionMode,            # enum
    ConvertResult,             # dataclass: src, dest (str)
    BulkConvertResult,         # dataclass: src, dest?, error?, ok
    MdkaError,                 # exception
)

Usage — CLI

The mdka command-line tool is provided by the mdka-cli crate.

Quick Reference

mdka [OPTIONS] [FILE...]

Run mdka --help to see the full option list with descriptions.

Common Patterns

Convert from stdin:

echo '<h1>Hello</h1>' | mdka
curl https://example.com | mdka

Convert a single file (output goes to the same directory):

mdka page.html          # → page.md

Convert to a specific directory:

mdka -o out/ page.html  # → out/page.md

Bulk conversion (-o is required for multiple files):

mdka -o out/ docs/*.html

Choose a conversion mode:

mdka --mode minimal --drop-shell page.html   # extract body text
mdka --mode preserve -o archive/ *.html      # maximum fidelity

All Options

FlagDescription
-o, --output <DIR>Output directory (default: same as input)
-m, --mode <MODE>balanced · strict · minimal · semantic · preserve
--preserve-idsKeep id attributes
--preserve-classesKeep class attributes
--preserve-dataKeep data-* attributes
--preserve-ariaKeep aria-* attributes
--drop-shellRemove nav, header, footer, aside
-h, --helpShow help

For full mode descriptions see Conversion Modes.

API Reference

mdka exposes a small, focused public API. The table below shows the complete surface — every function and type you need, nothing you don’t.

Functions

FunctionLanguageDescription
html_to_markdownRustConvert HTML string → Markdown (default mode)
html_to_markdown_withRustConvert with explicit ConversionOptions
html_file_to_markdownRustConvert one file; output alongside input or to out_dir
html_file_to_markdown_withRustSingle file with options
html_files_to_markdownRustParallel bulk conversion (rayon)
html_files_to_markdown_withRustBulk with options

Types

TypeDescription
ConversionModeEnum: Balanced · Strict · Minimal · Semantic · Preserve
ConversionOptionsControls pre-processing per-call; built via for_mode()
ConvertResultReturned by single-file functions: src + dest paths
MdkaErrorThe only error type: wraps std::io::Error

Guarantees

  • html_to_markdown and html_to_markdown_with never panic. They accept any &str, including empty strings, binary garbage, or deeply nested HTML.
  • File functions propagate IO errors via Result<_, MdkaError>.
  • Output is always valid UTF-8.
  • Output always ends with a single newline when the input produces any content.

Core Functions

html_to_markdown

#![allow(unused)]
fn main() {
pub fn html_to_markdown(html: &str) -> String
}

Converts an HTML string to Markdown using the default Balanced mode.

Input: Any valid or malformed HTML string. Empty strings are accepted.
Output: A Markdown string. Always ends with \n if the input produced any content.
Errors: None — this function is infallible.

#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown("<h1>Hello</h1>");
assert_eq!(md, "# Hello\n");
}

html_to_markdown_with

#![allow(unused)]
fn main() {
pub fn html_to_markdown_with(html: &str, opts: &ConversionOptions) -> String
}

Same as html_to_markdown, but accepts a ConversionOptions value that controls pre-processing and conversion behaviour.

Input: Any HTML string + a ConversionOptions value.
Output: Markdown string.
Errors: None.

#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}

html_file_to_markdown

#![allow(unused)]
fn main() {
pub fn html_file_to_markdown(
    path: impl AsRef<Path>,
    out_dir: Option<impl AsRef<Path>>,
) -> Result<ConvertResult, MdkaError>
}

Reads one HTML file, converts it, and writes a .md file.

path: Path to the input .html file.
out_dir:

  • None → the .md file is written alongside the input (same directory, stem unchanged).
  • Some(dir) → the .md file is written into dir. The directory is created automatically if it does not exist.

Returns: ConvertResult with the resolved src and dest paths.
Errors: MdkaError::Io if the file cannot be read or the output cannot be written.

#![allow(unused)]
fn main() {
// page.html → page.md in the same folder
let r = mdka::html_file_to_markdown("page.html", None::<&str>)?;

// page.html → out/page.md
let r = mdka::html_file_to_markdown("page.html", Some("out/"))?;
println!("{} → {}", r.src.display(), r.dest.display());
}

html_file_to_markdown_with

#![allow(unused)]
fn main() {
pub fn html_file_to_markdown_with(
    path: impl AsRef<Path>,
    out_dir: Option<impl AsRef<Path>>,
    opts: &ConversionOptions,
) -> Result<ConvertResult, MdkaError>
}

Same as html_file_to_markdown, but applies the given ConversionOptions.


html_files_to_markdown

#![allow(unused)]
fn main() {
pub fn html_files_to_markdown<'a, P>(
    paths: &'a [P],
    out_dir: &Path,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
    P: AsRef<Path> + Sync,
}

Converts multiple HTML files in parallel using rayon.

paths: Slice of paths to input HTML files.
out_dir: Directory for all output .md files. Must exist before calling (unlike single-file variants which create it automatically).
Returns: A Vec of (input_path, Result<output_path, error>) pairs in the same order as paths. Each element represents the outcome for one file independently.

#![allow(unused)]
fn main() {
use std::path::Path;

let files = vec!["a.html", "b.html", "c.html"];
std::fs::create_dir_all("out/")?;

for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
    match result {
        Ok(dest) => println!("{} → {}", src, dest.display()),
        Err(e)   => eprintln!("{src}: {e}"),
    }
}
}

html_files_to_markdown_with

#![allow(unused)]
fn main() {
pub fn html_files_to_markdown_with<'a, P>(
    paths: &'a [P],
    out_dir: &Path,
    opts: &ConversionOptions,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
    P: AsRef<Path> + Sync,
}

Same as html_files_to_markdown, but applies the given ConversionOptions to every file.


ConvertResult

#![allow(unused)]
fn main() {
pub struct ConvertResult {
    pub src:  PathBuf,
    pub dest: PathBuf,
}
}

Returned by the single-file functions. Both fields are absolute or relative paths depending on how path was passed in.

Note: The bulk functions (html_files_to_markdown*) return (&P, Result<PathBuf, MdkaError>) tuples rather than ConvertResult, because individual files within a batch may fail independently.

Conversion Modes

A conversion mode determines how mdka pre-processes the parsed DOM before converting to Markdown. Choose the mode that matches the origin and purpose of your HTML.

Overview

ModeBest forDefault?
BalancedGeneral use, blog posts, documentation pages✅ Yes
StrictDebugging, comparing before/after, diff-friendly output
MinimalLLM pre-processing, text extraction, compression
SemanticSPA output, accessibility-aware pipelines, screen-reader content
PreserveArchiving, audit trails, round-trip fidelity

Balanced (default)

Goal: Produce clean, readable Markdown without losing meaningful content.

  • Removes decorative attributes: class, style, data-*
  • Keeps semantic attributes: href, src, alt, aria-*, lang, dir
  • Keeps id attributes (useful for anchor links)
  • Does not remove navigation or structural elements

Use when: You want good-looking output without extra configuration.

#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown(html); // Balanced is the default
}

Strict

Goal: Preserve as much of the original HTML information as possible. Output may be noisier, but nothing is silently dropped.

  • Keeps class, data-*, id, aria-*, and most other attributes
  • Does not unwrap wrapper elements
  • Suitable for comparing two versions of a page, or for debugging unexpected output from other modes
#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let opts = ConversionOptions::for_mode(ConversionMode::Strict);
let md = mdka::html_to_markdown_with(html, &opts);
}

Minimal

Goal: Extract the body text and essential structure; discard everything else.

  • Removes all decorative attributes (class, style, data-*, aria-*)
  • Optionally removes shell elements (nav, header, footer, aside) when drop_interactive_shell is true
  • Unwraps generic wrappers (div, span, section, article) that add no meaning
  • Ideal for piping content into an LLM prompt or a search index
#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}

Semantic

Goal: Preserve document meaning and accessibility structure over visual appearance.

  • Strongly retains aria-* attributes
  • Retains lang and dir
  • Retains heading hierarchy, list structure, link targets, and image alt text
  • Removes purely visual attributes (class, style)
  • Unwraps anonymous wrappers
  • Good for SPA-rendered HTML where ARIA attributes carry structural meaning
#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let md = mdka::html_to_markdown_with(html, &opts);
}

Preserve

Goal: Maximum fidelity to the original HTML. Lose as little information as possible.

  • Retains all attributes, including class, data-*, aria-*, id, and unknowns
  • Retains HTML comments in the pre-processed output
  • Does not unwrap any elements
  • Intended for archiving or audit scenarios where the original structure matters
#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Preserve);
let md = mdka::html_to_markdown_with(html, &opts);
}

Choosing a Mode

Is reproducibility the goal?          → Preserve
Are you feeding content to an LLM?    → Minimal  (+drop_interactive_shell)
Is the source a SPA or ARIA-heavy?    → Semantic
Debugging unexpected output?           → Strict
Everything else                        → Balanced  (default)

ConversionOptions

#![allow(unused)]
fn main() {
pub struct ConversionOptions {
    pub mode: ConversionMode,

    // Attribute retention
    pub preserve_ids:             bool,
    pub preserve_classes:         bool,
    pub preserve_data_attrs:      bool,
    pub preserve_aria_attrs:      bool,
    pub preserve_unknown_attrs:   bool,

    // Pre-processing behaviour
    pub drop_presentation_attrs:  bool,
    pub drop_interactive_shell:   bool,
    pub unwrap_unknown_wrappers:  bool,
}
}

ConversionOptions controls every detail of the pre-processing pipeline. You rarely need to set individual fields — start with a mode and override only what differs from the default for that mode.

Creating Options

#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let opts = ConversionOptions::for_mode(ConversionMode::Minimal);
}

for_mode returns sensible defaults for the chosen mode. See the table below.

Modify fields after creation

#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Balanced);
opts.drop_interactive_shell = true; // also strip nav/header/footer/aside
opts.preserve_ids           = false; // don't keep id= attributes
opts.preserve_aria_attrs    = true;  // (already true in Balanced, shown for clarity)
}

Default

#![allow(unused)]
fn main() {
let opts = ConversionOptions::default(); // equivalent to for_mode(Balanced)
}

Field Defaults by Mode

FieldBalancedStrictMinimalSemanticPreserve
preserve_ids
preserve_classes
preserve_data_attrs
preserve_aria_attrs
preserve_unknown_attrs
drop_presentation_attrs
drop_interactive_shell
unwrap_unknown_wrappers

Field Reference

mode

The ConversionMode this options object was built from. Changing mode after construction does not re-apply mode defaults to the other fields — use for_mode() again instead.

preserve_ids

Whether to keep id="…" attributes in the pre-processed DOM. Useful when the output is rendered in a context that relies on anchor links (#section-name).

preserve_classes

Whether to keep class="…" attributes. Rarely useful in Markdown output, but can help when feeding the Markdown back into an HTML renderer that applies CSS.

preserve_data_attrs

Whether to keep data-* custom attributes. Mostly relevant for Strict and Preserve modes.

preserve_aria_attrs

Whether to keep aria-* accessibility attributes. Enabled by default in Balanced, Strict, Semantic, and Preserve. The attributes themselves do not appear in Markdown output, but they are used by the Semantic mode’s conversion logic.

preserve_unknown_attrs

Whether to keep attributes not otherwise handled (everything except href, src, alt, title, aria-*, data-*, id, class, style).

drop_presentation_attrs

Whether to remove style and other purely visual attributes during pre-processing. Enabled in Balanced, Minimal, and Semantic.

drop_interactive_shell

Whether to remove <nav>, <header>, <footer>, and <aside> elements and all their children. Useful for content extraction from full web pages. Disabled by default in all modes; opt in explicitly.

unwrap_unknown_wrappers

Whether to replace generic container elements (<div>, <span>, <section>, <article>, <main>) with their children when they carry no structural meaning. Enabled in Minimal and Semantic.

Error Handling

MdkaError

#![allow(unused)]
fn main() {
#[derive(Error, Debug)]
pub enum MdkaError {
    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
}
}

MdkaError is the only error type in mdka. It has one variant, Io, which wraps a std::io::Error.

IO errors arise from the file-based functions when:

  • the input file does not exist or is not readable
  • the output directory cannot be created
  • the output file cannot be written

Infallible Functions

html_to_markdown and html_to_markdown_with never fail. They accept any string and return a String. Malformed HTML, empty input, binary-looking content, deeply nested structures — none of these cause a panic or an error.

Pattern Matching

#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};

match html_file_to_markdown("page.html", None::<&str>) {
    Ok(result)            => println!("→ {}", result.dest.display()),
    Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}

Because there is only one variant today, you can also use ? directly:

#![allow(unused)]
fn main() {
let result = mdka::html_file_to_markdown("page.html", None::<&str>)?;
}

Bulk Conversion Errors

In html_files_to_markdown, each file fails independently. A failed file does not abort the rest of the batch:

#![allow(unused)]
fn main() {
for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
    if let Err(e) = result {
        eprintln!("skipped {}: {e}", src);
    }
}
}

Supported HTML Elements

The table below shows every HTML element that mdka recognises and what Markdown it produces. Elements not listed are either silently removed (script, style, etc.) or their children are kept as plain text.

Block Elements

HTMLMarkdown outputNotes
<h1><h6># ###### ATX-style headings
<p>Paragraph (blank lines around)
<blockquote>> prefixNesting produces > > , > > > , …
<pre><code>Fenced code block ```Preserves whitespace and newlines
<ul>- listNested lists indented by 2 spaces
<ol>1. listRespects start attribute
<li>List item
<hr>---
<div>, <article>, <section>, <main>, <figure>, <figcaption>Block separatorAct as paragraph breaks; unwrapped in Minimal/Semantic

Inline Elements

HTMLMarkdown outputNotes
<strong>, <b>**text**
<em>, <i>*text*
<code> (inline)`text`Only when not inside <pre>
<a href="…">[text](url)title attribute → [text](url "title")
<img src="…" alt="…">![alt](src)title attribute → ![alt](src "title")
<br> \n (trailing two spaces + newline)

Code Blocks and Language Hints

When a <code> element has a class containing language-<name>, the language name is included in the fenced block:

<pre><code class="language-rust">fn main() {}</code></pre>

Produces:

```rust
fn main() {}
```

The language-* class is preserved in all conversion modes, including Balanced which otherwise strips class attributes.

Always-Removed Elements

These elements and all their descendants are removed unconditionally, regardless of conversion mode:

<script> · <style> · <meta> · <link> · <template> · <iframe> · <object> · <embed> · <noscript>

HTML comments are removed in all modes except Preserve, where they are retained as <!-- … --> in the pre-processed DOM (though they do not appear in Markdown output).

Shell Elements

<nav>, <header>, <footer>, <aside> are kept by default but can be removed by setting drop_interactive_shell = true or using ConversionMode::Minimal.

Text Processing Rules

mdka applies a small set of deterministic rules to produce consistent, readable Markdown from any HTML text content.

Whitespace Normalisation

HTML text nodes are normalised according to the HTML whitespace collapsing rules:

  • Leading and trailing whitespace is trimmed from block-level context.
  • Consecutive whitespace characters (spaces, tabs, newlines) within a text node are collapsed to a single space.
  • A single space is preserved between adjacent inline elements.
  • <br> produces a hard line break ( \n).
  • <pre> blocks are exempt — whitespace inside <pre> is reproduced exactly.

This is done in a single pass without regular expressions, which keeps allocation overhead low.

Markdown Character Escaping

To prevent accidental Markdown formatting, the following characters are escaped with a backslash when they appear in text content that is not inside a code span or code block:

CharacterEscaped asContext
*\*Would create emphasis
_\_Would create emphasis
`\`Would start a code span
#\#At the start of a line, would create a heading
[\[Would start a link
!\!Before [, would start an image
\\\The escape character itself

Escaping is context-aware: a # in the middle of a line is not escaped, only at the start of a line where it would be interpreted as an ATX heading.

HTML Entity Decoding

HTML entities in text nodes are decoded by the HTML parser (scraper / html5ever) before mdka processes them. The result is already Unicode text:

HTML entityAfter parsingIn Markdown
&amp;&&
&lt;<<
&gt;>>
&nbsp;non-breaking spacepreserved as space

Output Boundaries

  • Output always ends with exactly one newline (\n) when the input produces any content; the output is empty for empty input.
  • Leading blank lines that scraper adds when wrapping content in <html><body> are trimmed before the final string is returned.
  • Block elements (paragraphs, headings, lists, etc.) are separated by blank lines.

Design Philosophy

The Goal: Balance, not Dominance

There are excellent HTML-to-Markdown libraries in the Rust ecosystem — some prioritise raw speed, others maximise conversion fidelity. mdka is not trying to beat them on every axis.

Its aim is a practical balance:

Produce stable, readable Markdown from real-world HTML, with an easy API, without surprising the caller at runtime.

Speed and memory efficiency matter, and mdka is designed with both in mind. But they are means to an end, not the end itself.

Real-World HTML is Messy

Web content rarely arrives as clean, well-formed documents. In practice you encounter:

  • HTML that a CMS generated and no human ever wrote
  • SPA-rendered DOM fragments extracted from DevTools
  • Scraped pages with ad slots, cookie banners, and navigation wrapped around the content
  • Documents with 5,000 levels of nested <div> elements
  • Missing closing tags, duplicate attributes, and unknown elements

mdka uses scraper, which is built on html5ever — the same parser used by the Servo browser engine. It applies the HTML5 parsing algorithm, meaning: unknown elements are handled gracefully, missing tags are inferred, and the result is always a well-formed DOM tree, regardless of the input quality.

No Stack Overflows

A common failure mode in tree-processing code is stack overflow on deeply nested input. mdka uses an explicit Vec-based stack (non-recursive DFS) for every tree traversal — both in the pre-processing pipeline and in the Markdown conversion step. This means it handles any nesting depth that fits in heap memory.

Configurable Pre-Processing

HTML from different sources needs different treatment. A page scraped from a news site has navigation, advertising, and footer content that a content extraction pipeline wants to remove. A document being archived for audit purposes should retain as much as possible.

The five conversion modes encode these intent differences as named, opinionated presets. They are applied in a pre-processing pass that filters the DOM before Markdown conversion runs — keeping the conversion logic itself simple and mode-agnostic.

One Allocator, Minimal Copies

The conversion pipeline is designed to minimise heap allocations:

  • Whitespace normalisation is done in a single pass, writing directly into the output String.
  • No regular expressions are used at runtime (avoiding compiled regex objects).
  • The output String is pre-allocated with a capacity estimate.
  • The #[global_allocator] counter in the CLI and benchmarks measures this directly.

Performance Characteristics

The Focus of mdka

The Rust ecosystem offers a variety of excellent HTML-to-Markdown converters. Many of these projects prioritize feature-richness, complex edge-case handling, or high extensibility.

mdka takes a different approach. Our mission is to provide a “minimalist, lightweight, and memory-efficient” converter, specifically optimized for resource-constrained environments or high-concurrency tasks where overhead must be kept to an absolute minimum.

The benchmarks presented here are not intended to rank libraries or declare a “winner.” Instead, they serve as internal metrics to verify whether mdka is successfully meeting its own design goals. We believe in choosing the right tool for the specific job, and we encourage developers to explore the diverse range of libraries available in the ecosystem to find the one that best fits their needs.

The Evolution: v1 to v2

With the release of v2, mdka underwent a complete architectural overhaul. We moved away from the original implementation to a ground-up rewrite focused on:

  • Stack-Safe Traversal: Implementing a non-recursive Deep First Search (DFS) to prevent stack overflow even with deeply nested HTML.
  • Optimized Memory Allocation: Reducing unnecessary clones and leveraging Rust’s ownership model to minimize peak memory usage.
  • Streamlined Processing: Simplifying the conversion logic to achieve a predictable and lightweight execution path.

This rewrite resulted in a dramatic performance leap and a significantly reduced memory footprint compared to our previous version.

Benchmark Results (2026-04-15)

The following data demonstrates how the v2 architecture has improved our efficiency and how it aligns with our goal of “reasonable speed with minimal resource consumption.”

The figures below are wall-clock medians from Criterion. The log also records outliers for each run, so small differences should be read with some caution.

Conditions

All libraries were benchmarked under the same conditions:
Linux x86_64 6.19, Rust 1.94.1, Criterion 0.8, 28 logical cores, 3 s warm-up, and 3 s measurement.

Libraries Under Test

LibraryVersionHTML parserApproach
mdka2.0.0scraper (html5ever)Full DOM tree; non-recursive DFS
mdka_v11.6.9html5everFull DOM tree; older implementation
html2md0.2.15html5everDOM-based converter
fast_html2md0.0.61lol_htmlStreaming rewriter
htmd0.5.4html5everDOM-based converter
html_to_markdown_rs3.1.0html5everDOM-based converter
html2text0.16.7html5everText-oriented converter
dom_smoothie0.17.0dom_query (html5ever)DOM-oriented converter

These libraries do not share the same design and do have different approach and goals.

Conversion Speed

Datasetmdka v2mdka v1html2mdfast_html2mdhtmdhtml_to_markdown_rshtml2textdom_smoothie
small131.52 µs131.66 µs132.21 µs79.50 µs90.47 µs107.82 µs350.92 µs317.37 µs
medium1.3040 ms2.2866 ms1.5266 ms887.59 µs1.0562 ms1.1660 ms3.3999 ms2.7643 ms
large12.336 ms75.751 ms12.455 ms7.0399 ms7.7896 ms9.6825 ms29.854 ms26.062 ms
deep_nest32.620 ms373.10 ms36.834 ms5.9868 ms72.481 ms96.744 ms30.903 ms29.408 ms
flat5.6253 ms24.817 ms6.7911 ms4.2114 ms5.5321 ms4.6975 ms14.023 ms29.408 ms
malformed31.712 µs40.178 µs71.778 µs52.948 µs62.302 µs41.109 µs96.822 µs5.6401 ms

mdka v2 is clearly ahead of mdka v1 in this run. The gain is small on the smallest input, but it becomes much more visible as the input gets larger or structurally harder: around 1.75× faster on medium, 6.1× on large, 11.4× on deep_nest, and 4.4× on flat. On malformed input, v2 is also faster than v1 and the fastest.

Memory Allocation

Datasetmdka v2mdka_v1html2mdfast_html2mdhtmdhtml_to_markdown_rshtml2textdom_smoothie
small113.5 KB240 KB231 KB154 KB93.6 KB232.5 KB764.5 KB325.4 KB
medium984.6 KB2.03 MB1.95 MB1.52 MB1.01 MB1.95 MB8.50 MB2.85 MB
large8.00 MB17.0 MB16.76 MB11.98 MB7.85 MB16.76 MB74.89 MB23.08 MB
deep_nest3.00 MB4.71 MB2.55 MB6.85 MB1.96 MB2.55 MB18.48 MB
flat3.93 MB7.90 MB7.87 MB7.46 MB4.84 MB7.87 MB40.28 MB35.47 MB
malformed44.7 KB91.6 KB71.4 KB145 KB62.3 KB71.4 KB464.4 KB1.63 MB

In this run, mdka v2 uses less heap than v1.

Summary

As shown in the results, the transition to v2 has allowed us to achieve our objectives of being lightweight and memory-efficient while maintaining competitive speed.

We recognize that other libraries may offer more features or different trade-offs that make them better suited for certain applications. mdka aims to be the best choice for those who prioritize a simple, “Unix-style” tool that does one thing—conversion—with the smallest possible footprint.

Architecture

Workspace Layout

mdka/
├── src/               mdka library crate (lib only)
│   ├── lib.rs             Public API surface
│   ├── options.rs         ConversionMode, ConversionOptions
│   ├── traversal.rs       Markdown conversion traversal
│   ├── renderer.rs        MarkdownRenderer state machine
│   ├── utils.rs           Whitespace normalisation + escaping
│   └── alloc_counter.rs   Custom allocator (for benchmarks)
├── tests/             integration test modules
│   └── utils/preprocessor.rs    DOM pre-processing pipeline
├── cli/               mdka-cli binary crate
│   └── src/main.rs        Argument parsing + dispatch
├── node/              Node.js bindings (napi-rs v3)
├── python/            Python bindings (PyO3 v0)
├── benches/           criterion benchmarks
└── examples/          Allocation measurement tool

Conversion Pipeline

Each call to html_to_markdown_with follows these steps:

HTML string
    │
    ▼
[1] Parse          scraper::Html::parse_document()
    │               → html5ever DOM tree (tolerant HTML5 parsing)
    ▼
[2] Pre-process    preprocessor::preprocess(&doc, opts)
    │               → filtered HTML string
    │               Non-recursive DFS over ego-tree nodes
    │               Drops: script, style, iframe, …
    │               Filters attributes per ConversionOptions
    │               Removes shell elements (if opted in)
    │               Unwraps anonymous wrappers (if opted in)
    ▼
[3] Re-parse       scraper::Html::parse_document(&cleaned)
    │               → clean DOM for conversion
    ▼
[4] Convert        traversal::traverse(&doc)
    │               → Markdown string
    │               Non-recursive DFS with Enter/Leave events
    │               Drives MarkdownRenderer via event callbacks
    ▼
[5] Finalise       renderer.finish()
                    → trim leading/trailing whitespace
                    → ensure single trailing newline

MarkdownRenderer

MarkdownRenderer is a state machine that maintains:

  • output: the accumulated Markdown string
  • list_stack: tracks nested ordered/unordered lists
  • blockquote_depth: counts blockquote nesting level
  • in_pre: whether inside a <pre> block
  • at_line_start: deferred prefix flag for blockquote > emission
  • newlines_emitted: prevents double-blank-line accumulation

The at_line_start flag is key: rather than emitting > prefixes immediately when entering a blockquote, the renderer defers them until actual content is written. This ensures nested blockquotes emit the correct number of > characters regardless of how many block elements intervene.

Language Bindings

Both the Node.js and Python bindings are thin wrappers:

  • Node.js (napi-rs): exposes sync and async (tokio::spawn_blocking) variants. The async variants release the Node.js event loop during conversion.
  • Python (PyO3): exposes py.detach() on the batch function html_to_markdown_many, releasing the GIL for rayon parallel conversion.

The binding crates (mdka-node, mdka-python) have no conversion logic of their own — they call the same Rust functions as the library and CLI.

Features

Crash Resistance

mdka uses non-recursive DFS traversal throughout. An explicit Vec stack replaces the call stack, so documents with arbitrarily deep nesting will not cause a stack overflow. This has been tested with 10,000 levels of nested <div> elements.

Some fast converters use recursive tree traversal and will crash on deeply nested input. If your input source is not fully controlled, crash resistance matters.

Five Conversion Modes

Rather than a single fixed conversion strategy, mdka offers five named modes that tune the pre-processing pipeline:

  • Balanced — readable output for general use
  • Strict — maximum attribute retention for debugging
  • Minimal — body text only; good for LLM input preparation
  • Semantic — preserves ARIA and document structure
  • Preserve — maximum fidelity for archiving

Each mode can be further customised with per-call option flags. See Conversion Modes and ConversionOptions.

Parallel File Conversion

html_files_to_markdown and html_files_to_markdown_with use rayon to convert multiple files in parallel. Each file’s result is independent — one failed file does not stop the batch.

The Node.js and Python bindings expose this as an async function (htmlFilesToMarkdown, html_files_to_markdown) so the thread pool work does not block the caller’s event loop or hold the GIL.

Multi-Language API

The same Rust implementation is accessible from three languages:

LanguagePackageMechanism
Rustmdka on crates.ionative library
Node.jsmdka on npmnapi-rs native module
Pythonmdka on PyPIPyO3 extension module

All three call the same underlying conversion code and produce identical output for identical input.

html5ever Parser Foundation

The HTML parser is scraper, which is built on html5ever. html5ever implements the HTML5 parsing algorithm, the same one that web browsers use.

This means:

  • Missing closing tags are inferred correctly
  • Unknown elements are preserved (not silently dropped)
  • Malformed attribute syntax is normalised
  • The result is always a valid DOM tree, no matter the input

Predictable, Deterministic Output

For a given HTML input and ConversionOptions, mdka always produces the same Markdown string. There is no randomisation, no date-stamping, and no version-dependent output variation within a semver major version.

Minimal Dependencies

The runtime dependencies of the mdka library crate are:

CratePurpose
scraperHTML parsing (html5ever wrapper)
ego-treeDOM tree traversal
rayonParallel file conversion
tikv-jemallocator, tikv-jemalloc-ctlEnsures fragmentation avoidance and scalable concurrency
thiserrorMdkaError derive macro

Benchmark and comparison dependencies (criterion, competitors) are [dev-dependencies] and do not affect library consumers.