mdka

mdka is a HTML to Markdown written in Rust. “ka” means “化 (か)” pointing to conversion.

It aims to strike a practical balance between conversion quality and runtime efficiency — readable output from real-world HTML, without sacrificing speed or memory.

At a Glance

What you give it	What you get back
Any HTML string — a full page, a snippet, CMS output, SPA-rendered DOM	Clean, readable Markdown
A list of HTML files	Parallel Markdown output via rayon
A conversion mode (`minimal`, `semantic`, …)	Pre-processed output tuned for your use case

Key Properties

Parser foundation: scraper, which is built on html5ever — the same battle-tested parser used by the Servo browser engine. It handles malformed, deeply-nested, and real-world HTML gracefully.
Crash-resistant: a non-recursive DFS traversal means even 10,000 levels of nesting will not overflow the stack.
Configurable: five conversion modes let you tune the pre-processing pipeline — from noise-free LLM input to lossless archiving.
Multi-language: available as a Rust library, a Node.js package (napi-rs), and a Python package (PyO3).

When to Choose mdka

mdka is a good fit if you need:

Stable, predictable output from diverse HTML sources (CMS, SPA, scraped pages)
Mode-based pre-processing to strip navigation, preserve ARIA, or retain attributes
Memory efficiency at scale (bulk file conversion, streaming pipelines)
Multi-language access from a single underlying Rust implementation

If raw speed on simple, well-formed HTML is the only concern, a streaming rewriter will be faster.

New to mdka? Start with Installation.
Ready to integrate? Jump to Usage & Examples.
Evaluating? Read Design Philosophy and Performance Characteristics.

Installation

As a Rust Library

Add mdka to your Cargo.toml:

[dependencies]
mdka = "2"

That is the only step. mdka has no system dependencies.

Minimum Supported Rust Version: 1.85 (2024 Edition)

As a CLI Binary

Build from source using the mdka-cli crate in the workspace:

git clone https://github.com/example/mdka
cd mdka
cargo build --release -p mdka-cli
# Binary: ./target/release/mdka

Or install directly with cargo:

cargo install mdka-cli

As a Node.js Package

npm install mdka
# or
yarn add mdka

Requires Node.js 16 or later.
Pre-built binaries are bundled for major platforms such as Linux, macOS and Windows of specific architecture.
On other platforms, run npm run build with Rust installed.

As a Python Package

pip install mdka

Requires Python 3.8 or later. Pre-built wheels are provided for CPython on major platforms. To build from source: pip install mdka --no-binary mdka with Rust installed.

Usage & Examples

Choose the section for your environment:

Rust — integrate directly into a Rust project
Node.js — use from JavaScript or TypeScript
Python — use from Python
CLI — use from the command line

All four share the same underlying conversion engine, so results are consistent across languages.

Usage — Rust

Basic Conversion

use mdka::html_to_markdown;

fn main() {
    let html = r#"
        <h1>Getting Started</h1>
        <p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
        <ul>
            <li>Fast</li>
            <li>Configurable</li>
            <li>Crash-resistant</li>
        </ul>
    "#;

    let md = html_to_markdown(html);
    println!("{md}");
}

Output:

# Getting Started

mdka converts **HTML** to *Markdown*.

- Fast
- Configurable
- Crash-resistant

Conversion with Options

Use html_to_markdown_with to control the conversion pipeline via ConversionOptions.

#![allow(unused)]
fn main() {
use mdka::{html_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};

// Strip navigation and extract body text — good for LLM input
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;

let html = r#"
    <header><nav><a href="/">Home</a></nav></header>
    <main>
        <article>
            <h1>Article Title</h1>
            <p>The main content of the page.</p>
        </article>
    </main>
    <footer>Copyright 2025</footer>
"#;

let md = html_to_markdown_with(html, &opts);
assert!(md.contains("# Article Title"));
assert!(!md.contains("Home"));       // nav removed
assert!(!md.contains("Copyright"));  // footer removed
}

Converting a Single File

#![allow(unused)]
fn main() {
use mdka::html_file_to_markdown;

// Output goes to the same directory as the input: page.html → page.md
let result = html_file_to_markdown("page.html", None::<&str>)?;
println!("{} → {}", result.src.display(), result.dest.display());

// Output goes to a specific directory
let result = html_file_to_markdown("page.html", Some("out/"))?;
}

Bulk Parallel Conversion

#![allow(unused)]
fn main() {
use mdka::html_files_to_markdown;
use std::path::Path;

let files = vec!["a.html", "b.html", "c.html"];
let out_dir = Path::new("out/");
std::fs::create_dir_all(out_dir)?;

for (src, result) in html_files_to_markdown(&files, out_dir) {
    match result {
        Ok(dest) => println!("{} → {}", src, dest.display()),
        Err(e)   => eprintln!("Error: {src}: {e}"),
    }
}
}

Conversion runs in parallel using rayon. The number of threads defaults to the number of logical CPU cores.

Bulk Conversion with Options

#![allow(unused)]
fn main() {
use mdka::{html_files_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};
use std::path::Path;

let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let files = vec!["a.html", "b.html"];
let results = html_files_to_markdown_with(&files, Path::new("out/"), &opts);
}

Conversion Modes at a Glance

Mode	Best for
`Balanced`	General use; default
`Strict`	Debugging, diff comparison
`Minimal`	LLM pre-processing, compression
`Semantic`	SPA content, accessibility-aware output
`Preserve`	Archiving, audit trails

See Conversion Modes for full details.

Error Handling

#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};

match html_file_to_markdown("missing.html", None::<&str>) {
    Ok(result) => println!("→ {}", result.dest.display()),
    Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}

MdkaError currently has one variant: Io, wrapping std::io::Error. html_to_markdown and html_to_markdown_with are infallible — they always return a String and never panic on any input, no matter how malformed.

Usage — Node.js

Installation

npm install mdka

Basic Conversion

const { htmlToMarkdown } = require('mdka')

const html = `
  <h1>Hello</h1>
  <p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
`
const md = htmlToMarkdown(html)
console.log(md)
// # Hello
//
// mdka converts **HTML** to *Markdown*.

Async Conversion

htmlToMarkdownAsync offloads work to a Rust thread pool, keeping the Node.js event loop free:

const { htmlToMarkdownAsync } = require('mdka')

const md = await htmlToMarkdownAsync(html)

// Concurrent conversion of many pages
const results = await Promise.all(pages.map(p => htmlToMarkdownAsync(p.html)))

Conversion with Options

const { htmlToMarkdownWith, htmlToMarkdownWithAsync } = require('mdka')

// Strip nav/header/footer — useful for content extraction
const md = htmlToMarkdownWith(html, {
  mode: 'minimal',
  dropInteractiveShell: true,
})

// Async version
const md = await htmlToMarkdownWithAsync(html, { mode: 'semantic' })

Available mode strings: "balanced" (default), "strict", "minimal", "semantic", "preserve".

Single File Conversion

const { htmlFileToMarkdown, htmlFileToMarkdownWith } = require('mdka')

// Output to same directory: page.html → page.md
const result = await htmlFileToMarkdown('page.html')
console.log(`${result.src} → ${result.dest}`)

// Output to specific directory
const result = await htmlFileToMarkdown('page.html', 'out/')

// With options
const result = await htmlFileToMarkdownWith('page.html', 'out/', {
  mode: 'minimal',
  dropInteractiveShell: true,
})

Bulk Parallel Conversion

const { htmlFilesToMarkdown, htmlFilesToMarkdownWith } = require('mdka')
const path = require('path')

const files = ['a.html', 'b.html', 'c.html']
const results = await htmlFilesToMarkdown(files, 'out/')

for (const r of results) {
  if (r.error) console.error(`${r.src}: ${r.error}`)
  else         console.log(`${r.src} → ${r.dest}`)
}

// With options
const results = await htmlFilesToMarkdownWith(files, 'out/', {
  mode: 'semantic',
  preserveAriaAttrs: true,
})

TypeScript

Type definitions are bundled. No @types/ package is needed:

import {
  htmlToMarkdown,
  htmlToMarkdownWith,
  htmlToMarkdownAsync,
  htmlFileToMarkdown,
  htmlFilesToMarkdown,
  ConversionOptions,
  ConvertResult,
} from 'mdka'

const opts: ConversionOptions = {
  mode: 'minimal',
  dropInteractiveShell: true,
}
const md: string = htmlToMarkdownWith(html, opts)

Usage — Python

Installation

pip install mdka

Basic Conversion

import mdka

html = """
<h1>Hello</h1>
<p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
"""

md = mdka.html_to_markdown(html)
print(md)
# # Hello
#
# mdka converts **HTML** to *Markdown*.

Conversion with Options

import mdka

# Strip nav/header/footer — useful for LLM pre-processing
md = mdka.html_to_markdown_with(
    html,
    mode=mdka.ConversionMode.Minimal,
    drop_interactive_shell=True,
)

# Preserve ARIA attributes for accessibility-aware output
md = mdka.html_to_markdown_with(
    html,
    mode=mdka.ConversionMode.Semantic,
    preserve_aria_attrs=True,
)

Available modes: ConversionMode.Balanced (default), Strict, Minimal, Semantic, Preserve.

Parallel Batch Conversion (GIL released)

html_to_markdown_many releases the GIL and uses rayon for parallel conversion:

import mdka

pages = ["<h1>A</h1>", "<p>B</p>", "<ul><li>C</li></ul>"]
results = mdka.html_to_markdown_many(pages)
# ['# A\n', 'B\n', '- C\n']

This is faster than calling html_to_markdown in a Python loop for large batches.

Single File Conversion

import mdka

# Output to same directory: page.html → page.md
result = mdka.html_file_to_markdown("page.html")
print(f"{result.src} → {result.dest}")

# Output to a specific directory
result = mdka.html_file_to_markdown("page.html", "out/")

# With options
result = mdka.html_file_to_markdown(
    "page.html",
    "out/",
    mode=mdka.ConversionMode.Minimal,
    drop_interactive_shell=True,
)

Bulk File Conversion

import mdka

files = ["a.html", "b.html", "c.html"]
results = mdka.html_files_to_markdown(files, "out/")

for r in results:
    if r.ok:
        print(f"{r.src} → {r.dest}")
    else:
        print(f"Error: {r.src}: {r.error}")

Error Handling

import mdka

try:
    result = mdka.html_file_to_markdown("missing.html")
except mdka.MdkaError as e:
    print(f"Conversion failed: {e}")

MdkaError is raised for IO errors (file not found, permission denied, etc.). html_to_markdown and html_to_markdown_with are always safe to call — they never raise exceptions regardless of input quality.

Type Annotations

mdka ships with a py.typed marker (PEP 561). All public symbols are annotated:

from mdka import (
    html_to_markdown,          # (html: str) -> str
    html_to_markdown_with,     # (html: str, mode=..., **flags) -> str
    html_to_markdown_many,     # (html_list: list[str]) -> list[str]
    html_file_to_markdown,     # (path, out_dir=None, ...) -> ConvertResult
    html_files_to_markdown,    # (paths, out_dir, ...) -> list[BulkConvertResult]
    ConversionMode,            # enum
    ConvertResult,             # dataclass: src, dest (str)
    BulkConvertResult,         # dataclass: src, dest?, error?, ok
    MdkaError,                 # exception
)

Usage — CLI

The mdka command-line tool is provided by the mdka-cli crate.

Quick Reference

mdka [OPTIONS] [FILE...]

Run mdka --help to see the full option list with descriptions.

Common Patterns

Convert from stdin:

echo '<h1>Hello</h1>' | mdka
curl https://example.com | mdka

Convert a single file (output goes to the same directory):

mdka page.html          # → page.md

Convert to a specific directory:

mdka -o out/ page.html  # → out/page.md

Bulk conversion (-o is required for multiple files):

mdka -o out/ docs/*.html

Choose a conversion mode:

mdka --mode minimal --drop-shell page.html   # extract body text
mdka --mode preserve -o archive/ *.html      # maximum fidelity

All Options

Flag	Description
`-o, --output <DIR>`	Output directory (default: same as input)
`-m, --mode <MODE>`	`balanced` · `strict` · `minimal` · `semantic` · `preserve`
`--preserve-ids`	Keep `id` attributes
`--preserve-classes`	Keep `class` attributes
`--preserve-data`	Keep `data-*` attributes
`--preserve-aria`	Keep `aria-*` attributes
`--drop-shell`	Remove `nav`, `header`, `footer`, `aside`
`-h, --help`	Show help

For full mode descriptions see Conversion Modes.

API Reference

mdka exposes a small, focused public API. The table below shows the complete surface — every function and type you need, nothing you don’t.

Functions

Function	Language	Description
`html_to_markdown`	Rust	Convert HTML string → Markdown (default mode)
`html_to_markdown_with`	Rust	Convert with explicit `ConversionOptions`
`html_file_to_markdown`	Rust	Convert one file; output alongside input or to `out_dir`
`html_file_to_markdown_with`	Rust	Single file with options
`html_files_to_markdown`	Rust	Parallel bulk conversion (rayon)
`html_files_to_markdown_with`	Rust	Bulk with options

Types

Type	Description
`ConversionMode`	Enum: `Balanced` · `Strict` · `Minimal` · `Semantic` · `Preserve`
`ConversionOptions`	Controls pre-processing per-call; built via `for_mode()`
`ConvertResult`	Returned by single-file functions: `src` + `dest` paths
`MdkaError`	The only error type: wraps `std::io::Error`

Guarantees

html_to_markdown and html_to_markdown_with never panic. They accept any &str, including empty strings, binary garbage, or deeply nested HTML.
File functions propagate IO errors via Result<_, MdkaError>.
Output is always valid UTF-8.
Output always ends with a single newline when the input produces any content.

Core Functions

`html_to_markdown`

#![allow(unused)]
fn main() {
pub fn html_to_markdown(html: &str) -> String
}

Converts an HTML string to Markdown using the default Balanced mode.

Input: Any valid or malformed HTML string. Empty strings are accepted.
Output: A Markdown string. Always ends with \n if the input produced any content.
Errors: None — this function is infallible.

#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown("<h1>Hello</h1>");
assert_eq!(md, "# Hello\n");
}

`html_to_markdown_with`

#![allow(unused)]
fn main() {
pub fn html_to_markdown_with(html: &str, opts: &ConversionOptions) -> String
}

Same as html_to_markdown, but accepts a ConversionOptions value that controls pre-processing and conversion behaviour.

Input: Any HTML string + a ConversionOptions value.
Output: Markdown string.
Errors: None.

#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}

`html_file_to_markdown`

#![allow(unused)]
fn main() {
pub fn html_file_to_markdown(
    path: impl AsRef<Path>,
    out_dir: Option<impl AsRef<Path>>,
) -> Result<ConvertResult, MdkaError>
}

Reads one HTML file, converts it, and writes a .md file.

path: Path to the input .html file.
out_dir:

None → the .md file is written alongside the input (same directory, stem unchanged).
Some(dir) → the .md file is written into dir. The directory is created automatically if it does not exist.

Returns: ConvertResult with the resolved src and dest paths.
Errors: MdkaError::Io if the file cannot be read or the output cannot be written.

#![allow(unused)]
fn main() {
// page.html → page.md in the same folder
let r = mdka::html_file_to_markdown("page.html", None::<&str>)?;

// page.html → out/page.md
let r = mdka::html_file_to_markdown("page.html", Some("out/"))?;
println!("{} → {}", r.src.display(), r.dest.display());
}

`html_file_to_markdown_with`

#![allow(unused)]
fn main() {
pub fn html_file_to_markdown_with(
    path: impl AsRef<Path>,
    out_dir: Option<impl AsRef<Path>>,
    opts: &ConversionOptions,
) -> Result<ConvertResult, MdkaError>
}

Same as html_file_to_markdown, but applies the given ConversionOptions.

`html_files_to_markdown`

#![allow(unused)]
fn main() {
pub fn html_files_to_markdown<'a, P>(
    paths: &'a [P],
    out_dir: &Path,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
    P: AsRef<Path> + Sync,
}

Converts multiple HTML files in parallel using rayon.

paths: Slice of paths to input HTML files.
out_dir: Directory for all output .md files. Must exist before calling (unlike single-file variants which create it automatically).
Returns: A Vec of (input_path, Result<output_path, error>) pairs in the same order as paths. Each element represents the outcome for one file independently.

#![allow(unused)]
fn main() {
use std::path::Path;

let files = vec!["a.html", "b.html", "c.html"];
std::fs::create_dir_all("out/")?;

for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
    match result {
        Ok(dest) => println!("{} → {}", src, dest.display()),
        Err(e)   => eprintln!("{src}: {e}"),
    }
}
}

`html_files_to_markdown_with`

#![allow(unused)]
fn main() {
pub fn html_files_to_markdown_with<'a, P>(
    paths: &'a [P],
    out_dir: &Path,
    opts: &ConversionOptions,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
    P: AsRef<Path> + Sync,
}

Same as html_files_to_markdown, but applies the given ConversionOptions to every file.

`ConvertResult`

#![allow(unused)]
fn main() {
pub struct ConvertResult {
    pub src:  PathBuf,
    pub dest: PathBuf,
}
}

Returned by the single-file functions. Both fields are absolute or relative paths depending on how path was passed in.

Note: The bulk functions (html_files_to_markdown*) return (&P, Result<PathBuf, MdkaError>) tuples rather than ConvertResult, because individual files within a batch may fail independently.

Conversion Modes

A conversion mode determines how mdka pre-processes the parsed DOM before converting to Markdown. Choose the mode that matches the origin and purpose of your HTML.

Overview

Mode	Best for	Default?
`Balanced`	General use, blog posts, documentation pages	✅ Yes
`Strict`	Debugging, comparing before/after, diff-friendly output
`Minimal`	LLM pre-processing, text extraction, compression
`Semantic`	SPA output, accessibility-aware pipelines, screen-reader content
`Preserve`	Archiving, audit trails, round-trip fidelity

Balanced (default)

Goal: Produce clean, readable Markdown without losing meaningful content.

Removes decorative attributes: class, style, data-*
Keeps semantic attributes: href, src, alt, aria-*, lang, dir
Keeps id attributes (useful for anchor links)
Does not remove navigation or structural elements

Use when: You want good-looking output without extra configuration.

#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown(html); // Balanced is the default
}

Strict

Goal: Preserve as much of the original HTML information as possible. Output may be noisier, but nothing is silently dropped.

Keeps class, data-*, id, aria-*, and most other attributes
Does not unwrap wrapper elements
Suitable for comparing two versions of a page, or for debugging unexpected output from other modes

#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let opts = ConversionOptions::for_mode(ConversionMode::Strict);
let md = mdka::html_to_markdown_with(html, &opts);
}

Minimal

Goal: Extract the body text and essential structure; discard everything else.

Removes all decorative attributes (class, style, data-*, aria-*)
Optionally removes shell elements (nav, header, footer, aside) when drop_interactive_shell is true
Unwraps generic wrappers (div, span, section, article) that add no meaning
Ideal for piping content into an LLM prompt or a search index

#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}

Semantic

Goal: Preserve document meaning and accessibility structure over visual appearance.

Strongly retains aria-* attributes
Retains lang and dir
Retains heading hierarchy, list structure, link targets, and image alt text
Removes purely visual attributes (class, style)
Unwraps anonymous wrappers
Good for SPA-rendered HTML where ARIA attributes carry structural meaning

#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let md = mdka::html_to_markdown_with(html, &opts);
}

Preserve

Goal: Maximum fidelity to the original HTML. Lose as little information as possible.

Retains all attributes, including class, data-*, aria-*, id, and unknowns
Retains HTML comments in the pre-processed output
Does not unwrap any elements
Intended for archiving or audit scenarios where the original structure matters

#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Preserve);
let md = mdka::html_to_markdown_with(html, &opts);
}

Choosing a Mode

Is reproducibility the goal?          → Preserve
Are you feeding content to an LLM?    → Minimal  (+drop_interactive_shell)
Is the source a SPA or ARIA-heavy?    → Semantic
Debugging unexpected output?           → Strict
Everything else                        → Balanced  (default)

ConversionOptions

#![allow(unused)]
fn main() {
pub struct ConversionOptions {
    pub mode: ConversionMode,

    // Attribute retention
    pub preserve_ids:             bool,
    pub preserve_classes:         bool,
    pub preserve_data_attrs:      bool,
    pub preserve_aria_attrs:      bool,
    pub preserve_unknown_attrs:   bool,

    // Pre-processing behaviour
    pub drop_presentation_attrs:  bool,
    pub drop_interactive_shell:   bool,
    pub unwrap_unknown_wrappers:  bool,
}
}

ConversionOptions controls every detail of the pre-processing pipeline. You rarely need to set individual fields — start with a mode and override only what differs from the default for that mode.

Creating Options

From a mode (recommended)

#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};

let opts = ConversionOptions::for_mode(ConversionMode::Minimal);
}

for_mode returns sensible defaults for the chosen mode. See the table below.

Modify fields after creation

#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Balanced);
opts.drop_interactive_shell = true; // also strip nav/header/footer/aside
opts.preserve_ids           = false; // don't keep id= attributes
opts.preserve_aria_attrs    = true;  // (already true in Balanced, shown for clarity)
}

Default

#![allow(unused)]
fn main() {
let opts = ConversionOptions::default(); // equivalent to for_mode(Balanced)
}

Field Defaults by Mode

Field	Balanced	Strict	Minimal	Semantic	Preserve
`preserve_ids`	✅	✅	❌	✅	✅
`preserve_classes`	❌	✅	❌	❌	✅
`preserve_data_attrs`	❌	✅	❌	❌	✅
`preserve_aria_attrs`	✅	✅	❌	✅	✅
`preserve_unknown_attrs`	❌	✅	❌	❌	✅
`drop_presentation_attrs`	✅	❌	✅	✅	❌
`drop_interactive_shell`	❌	❌	✅	❌	❌
`unwrap_unknown_wrappers`	❌	❌	✅	✅	❌

Field Reference

`mode`

The ConversionMode this options object was built from. Changing mode after construction does not re-apply mode defaults to the other fields — use for_mode() again instead.

`preserve_ids`

Whether to keep id="…" attributes in the pre-processed DOM. Useful when the output is rendered in a context that relies on anchor links (#section-name).

`preserve_classes`

Whether to keep class="…" attributes. Rarely useful in Markdown output, but can help when feeding the Markdown back into an HTML renderer that applies CSS.

`preserve_data_attrs`

Whether to keep data-* custom attributes. Mostly relevant for Strict and Preserve modes.

`preserve_aria_attrs`

Whether to keep aria-* accessibility attributes. Enabled by default in Balanced, Strict, Semantic, and Preserve. The attributes themselves do not appear in Markdown output, but they are used by the Semantic mode’s conversion logic.

`preserve_unknown_attrs`

Whether to keep attributes not otherwise handled (everything except href, src, alt, title, aria-*, data-*, id, class, style).

`drop_presentation_attrs`

Whether to remove style and other purely visual attributes during pre-processing. Enabled in Balanced, Minimal, and Semantic.

`drop_interactive_shell`

Whether to remove <nav>, <header>, <footer>, and <aside> elements and all their children. Useful for content extraction from full web pages. Disabled by default in all modes; opt in explicitly.

`unwrap_unknown_wrappers`

Whether to replace generic container elements (<div>, <span>, <section>, <article>, <main>) with their children when they carry no structural meaning. Enabled in Minimal and Semantic.

Error Handling

MdkaError

#![allow(unused)]
fn main() {
#[derive(Error, Debug)]
pub enum MdkaError {
    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
}
}

MdkaError is the only error type in mdka. It has one variant, Io, which wraps a std::io::Error.

IO errors arise from the file-based functions when:

the input file does not exist or is not readable
the output directory cannot be created
the output file cannot be written

Infallible Functions

html_to_markdown and html_to_markdown_with never fail. They accept any string and return a String. Malformed HTML, empty input, binary-looking content, deeply nested structures — none of these cause a panic or an error.

Pattern Matching

#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};

match html_file_to_markdown("page.html", None::<&str>) {
    Ok(result)            => println!("→ {}", result.dest.display()),
    Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}

Because there is only one variant today, you can also use ? directly:

#![allow(unused)]
fn main() {
let result = mdka::html_file_to_markdown("page.html", None::<&str>)?;
}

Bulk Conversion Errors

In html_files_to_markdown, each file fails independently. A failed file does not abort the rest of the batch:

#![allow(unused)]
fn main() {
for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
    if let Err(e) = result {
        eprintln!("skipped {}: {e}", src);
    }
}
}

Supported HTML Elements

The table below shows every HTML element that mdka recognises and what Markdown it produces. Elements not listed are either silently removed (script, style, etc.) or their children are kept as plain text.

Block Elements

HTML	Markdown output	Notes
`<h1>` – `<h6>`	`#` – `######`	ATX-style headings
`<p>`	Paragraph (blank lines around)
`<blockquote>`	`>` prefix	Nesting produces `> >` , `> > >` , …
`<pre><code>`	Fenced code block ```	Preserves whitespace and newlines
`<ul>`	`-` list	Nested lists indented by 2 spaces
`<ol>`	`1.` list	Respects `start` attribute
`<li>`	List item
`<hr>`	`---`
`<div>`, `<article>`, `<section>`, `<main>`, `<figure>`, `<figcaption>`	Block separator	Act as paragraph breaks; unwrapped in Minimal/Semantic

Inline Elements

HTML	Markdown output	Notes
`<strong>`, `<b>`	`text`
`<em>`, `<i>`	`text`
`<code>` (inline)	`text`	Only when not inside `<pre>`
`<a href="…">`	`[text](url)`	`title` attribute → `[text](url "title")`
`<img src="…" alt="…">`	`![alt](src)`	`title` attribute → `![alt](src "title")`
`<br>`	`\n` (trailing two spaces + newline)

Code Blocks and Language Hints

When a <code> element has a class containing language-<name>, the language name is included in the fenced block:

<pre><code class="language-rust">fn main() {}</code></pre>

Produces:

```rust
fn main() {}
```

The language-* class is preserved in all conversion modes, including Balanced which otherwise strips class attributes.

Always-Removed Elements

These elements and all their descendants are removed unconditionally, regardless of conversion mode:

<script> · <style> · <meta> · <link> · <template> · <iframe> · <object> · <embed> · <noscript>

HTML comments are removed in all modes except Preserve, where they are retained as  in the pre-processed DOM (though they do not appear in Markdown output).

Shell Elements

<nav>, <header>, <footer>, <aside> are kept by default but can be removed by setting drop_interactive_shell = true or using ConversionMode::Minimal.

Text Processing Rules

mdka applies a small set of deterministic rules to produce consistent, readable Markdown from any HTML text content.

Whitespace Normalisation

HTML text nodes are normalised according to the HTML whitespace collapsing rules:

Leading and trailing whitespace is trimmed from block-level context.
Consecutive whitespace characters (spaces, tabs, newlines) within a text node are collapsed to a single space.
A single space is preserved between adjacent inline elements.
<br> produces a hard line break ( \n).
<pre> blocks are exempt — whitespace inside <pre> is reproduced exactly.

This is done in a single pass without regular expressions, which keeps allocation overhead low.

Markdown Character Escaping

To prevent accidental Markdown formatting, the following characters are escaped with a backslash when they appear in text content that is not inside a code span or code block:

Character	Escaped as	Context
`*`	`\*`	Would create emphasis
`_`	`\_`	Would create emphasis
`	\`	Would start a code span
`#`	`\#`	At the start of a line, would create a heading
`[`	`\[`	Would start a link
`!`	`\!`	Before `[`, would start an image
`\`	`\\`	The escape character itself

Escaping is context-aware: a # in the middle of a line is not escaped, only at the start of a line where it would be interpreted as an ATX heading.

HTML Entity Decoding

HTML entities in text nodes are decoded by the HTML parser (scraper / html5ever) before mdka processes them. The result is already Unicode text:

HTML entity	After parsing	In Markdown
`&`	`&`	`&`
`<`	`<`	`<`
`>`	`>`	`>`
` `	non-breaking space	preserved as space

Output Boundaries

Output always ends with exactly one newline (\n) when the input produces any content; the output is empty for empty input.
Leading blank lines that scraper adds when wrapping content in <html><body> are trimmed before the final string is returned.
Block elements (paragraphs, headings, lists, etc.) are separated by blank lines.

Design Philosophy

The Goal: Balance, not Dominance

There are excellent HTML-to-Markdown libraries in the Rust ecosystem — some prioritise raw speed, others maximise conversion fidelity. mdka is not trying to beat them on every axis.

Its aim is a practical balance:

Produce stable, readable Markdown from real-world HTML, with an easy API, without surprising the caller at runtime.

Speed and memory efficiency matter, and mdka is designed with both in mind. But they are means to an end, not the end itself.

Real-World HTML is Messy

Web content rarely arrives as clean, well-formed documents. In practice you encounter:

HTML that a CMS generated and no human ever wrote
SPA-rendered DOM fragments extracted from DevTools
Scraped pages with ad slots, cookie banners, and navigation wrapped around the content
Documents with 5,000 levels of nested <div> elements
Missing closing tags, duplicate attributes, and unknown elements

mdka uses scraper, which is built on html5ever — the same parser used by the Servo browser engine. It applies the HTML5 parsing algorithm, meaning: unknown elements are handled gracefully, missing tags are inferred, and the result is always a well-formed DOM tree, regardless of the input quality.

No Stack Overflows

A common failure mode in tree-processing code is stack overflow on deeply nested input. mdka uses an explicit Vec-based stack (non-recursive DFS) for every tree traversal — both in the pre-processing pipeline and in the Markdown conversion step. This means it handles any nesting depth that fits in heap memory.

Configurable Pre-Processing

HTML from different sources needs different treatment. A page scraped from a news site has navigation, advertising, and footer content that a content extraction pipeline wants to remove. A document being archived for audit purposes should retain as much as possible.

The five conversion modes encode these intent differences as named, opinionated presets. They are applied in a pre-processing pass that filters the DOM before Markdown conversion runs — keeping the conversion logic itself simple and mode-agnostic.

One Allocator, Minimal Copies

The conversion pipeline is designed to minimise heap allocations:

Whitespace normalisation is done in a single pass, writing directly into the output String.
No regular expressions are used at runtime (avoiding compiled regex objects).
The output String is pre-allocated with a capacity estimate.
The #[global_allocator] counter in the CLI and benchmarks measures this directly.

Performance Characteristics

The Focus of mdka

The Rust ecosystem offers a variety of excellent HTML-to-Markdown converters. Many of these projects prioritize feature-richness, complex edge-case handling, or high extensibility.

mdka takes a different approach. Our mission is to provide a “minimalist, lightweight, and memory-efficient” converter, specifically optimized for resource-constrained environments or high-concurrency tasks where overhead must be kept to an absolute minimum.

The benchmarks presented here are not intended to rank libraries or declare a “winner.” Instead, they serve as internal metrics to verify whether mdka is successfully meeting its own design goals. We believe in choosing the right tool for the specific job, and we encourage developers to explore the diverse range of libraries available in the ecosystem to find the one that best fits their needs.

The Evolution: v1 to v2

With the release of v2, mdka underwent a complete architectural overhaul. We moved away from the original implementation to a ground-up rewrite focused on:

Stack-Safe Traversal: Implementing a non-recursive Deep First Search (DFS) to prevent stack overflow even with deeply nested HTML.
Optimized Memory Allocation: Reducing unnecessary clones and leveraging Rust’s ownership model to minimize peak memory usage.
Streamlined Processing: Simplifying the conversion logic to achieve a predictable and lightweight execution path.

This rewrite resulted in a dramatic performance leap and a significantly reduced memory footprint compared to our previous version.

Benchmark Results (2026-04-15)

The following data demonstrates how the v2 architecture has improved our efficiency and how it aligns with our goal of “reasonable speed with minimal resource consumption.”

The figures below are wall-clock medians from Criterion. The log also records outliers for each run, so small differences should be read with some caution.

Conditions

All libraries were benchmarked under the same conditions:
Linux x86_64 6.19, Rust 1.94.1, Criterion 0.8, 28 logical cores, 3 s warm-up, and 3 s measurement.

Libraries Under Test

Library	Version	HTML parser	Approach
mdka	2.0.0	`scraper` (`html5ever`)	Full DOM tree; non-recursive DFS
mdka_v1	1.6.9	`html5ever`	Full DOM tree; older implementation
html2md	0.2.15	`html5ever`	DOM-based converter
fast_html2md	0.0.61	`lol_html`	Streaming rewriter
htmd	0.5.4	`html5ever`	DOM-based converter
html_to_markdown_rs	3.1.0	`html5ever`	DOM-based converter
html2text	0.16.7	`html5ever`	Text-oriented converter
dom_smoothie	0.17.0	`dom_query` (`html5ever`)	DOM-oriented converter

These libraries do not share the same design and do have different approach and goals.

Conversion Speed

Dataset	mdka v2	mdka v1	html2md	fast_html2md	htmd	html_to_markdown_rs	html2text	dom_smoothie
small	131.52 µs	131.66 µs	132.21 µs	79.50 µs	90.47 µs	107.82 µs	350.92 µs	317.37 µs
medium	1.3040 ms	2.2866 ms	1.5266 ms	887.59 µs	1.0562 ms	1.1660 ms	3.3999 ms	2.7643 ms
large	12.336 ms	75.751 ms	12.455 ms	7.0399 ms	7.7896 ms	9.6825 ms	29.854 ms	26.062 ms
deep_nest	32.620 ms	373.10 ms	36.834 ms	5.9868 ms	72.481 ms	96.744 ms	30.903 ms	29.408 ms
flat	5.6253 ms	24.817 ms	6.7911 ms	4.2114 ms	5.5321 ms	4.6975 ms	14.023 ms	29.408 ms
malformed	31.712 µs	40.178 µs	71.778 µs	52.948 µs	62.302 µs	41.109 µs	96.822 µs	5.6401 ms

mdka v2 is clearly ahead of mdka v1 in this run. The gain is small on the smallest input, but it becomes much more visible as the input gets larger or structurally harder: around 1.75× faster on medium, 6.1× on large, 11.4× on deep_nest, and 4.4× on flat. On malformed input, v2 is also faster than v1 and the fastest.

Memory Allocation

Dataset	mdka v2	mdka_v1	html2md	fast_html2md	htmd	html_to_markdown_rs	html2text	dom_smoothie
small	113.5 KB	240 KB	231 KB	154 KB	93.6 KB	232.5 KB	764.5 KB	325.4 KB
medium	984.6 KB	2.03 MB	1.95 MB	1.52 MB	1.01 MB	1.95 MB	8.50 MB	2.85 MB
large	8.00 MB	17.0 MB	16.76 MB	11.98 MB	7.85 MB	16.76 MB	74.89 MB	23.08 MB
deep_nest	3.00 MB	4.71 MB	2.55 MB	6.85 MB	1.96 MB	2.55 MB	18.48 MB	—
flat	3.93 MB	7.90 MB	7.87 MB	7.46 MB	4.84 MB	7.87 MB	40.28 MB	35.47 MB
malformed	44.7 KB	91.6 KB	71.4 KB	145 KB	62.3 KB	71.4 KB	464.4 KB	1.63 MB

In this run, mdka v2 uses less heap than v1.

Summary

As shown in the results, the transition to v2 has allowed us to achieve our objectives of being lightweight and memory-efficient while maintaining competitive speed.

We recognize that other libraries may offer more features or different trade-offs that make them better suited for certain applications. mdka aims to be the best choice for those who prioritize a simple, “Unix-style” tool that does one thing—conversion—with the smallest possible footprint.

Architecture

Workspace Layout

mdka/
├── src/               mdka library crate (lib only)
│   ├── lib.rs             Public API surface
│   ├── options.rs         ConversionMode, ConversionOptions
│   ├── traversal.rs       Markdown conversion traversal
│   ├── renderer.rs        MarkdownRenderer state machine
│   ├── utils.rs           Whitespace normalisation + escaping
│   └── alloc_counter.rs   Custom allocator (for benchmarks)
├── tests/             integration test modules
│   └── utils/preprocessor.rs    DOM pre-processing pipeline
├── cli/               mdka-cli binary crate
│   └── src/main.rs        Argument parsing + dispatch
├── node/              Node.js bindings (napi-rs v3)
├── python/            Python bindings (PyO3 v0)
├── benches/           criterion benchmarks
└── examples/          Allocation measurement tool

Conversion Pipeline

Each call to html_to_markdown_with follows these steps:

HTML string
    │
    ▼
[1] Parse          scraper::Html::parse_document()
    │               → html5ever DOM tree (tolerant HTML5 parsing)
    ▼
[2] Pre-process    preprocessor::preprocess(&doc, opts)
    │               → filtered HTML string
    │               Non-recursive DFS over ego-tree nodes
    │               Drops: script, style, iframe, …
    │               Filters attributes per ConversionOptions
    │               Removes shell elements (if opted in)
    │               Unwraps anonymous wrappers (if opted in)
    ▼
[3] Re-parse       scraper::Html::parse_document(&cleaned)
    │               → clean DOM for conversion
    ▼
[4] Convert        traversal::traverse(&doc)
    │               → Markdown string
    │               Non-recursive DFS with Enter/Leave events
    │               Drives MarkdownRenderer via event callbacks
    ▼
[5] Finalise       renderer.finish()
                    → trim leading/trailing whitespace
                    → ensure single trailing newline

MarkdownRenderer

MarkdownRenderer is a state machine that maintains:

output: the accumulated Markdown string
list_stack: tracks nested ordered/unordered lists
blockquote_depth: counts blockquote nesting level
in_pre: whether inside a <pre> block
at_line_start: deferred prefix flag for blockquote > emission
newlines_emitted: prevents double-blank-line accumulation

The at_line_start flag is key: rather than emitting > prefixes immediately when entering a blockquote, the renderer defers them until actual content is written. This ensures nested blockquotes emit the correct number of > characters regardless of how many block elements intervene.

Language Bindings

Both the Node.js and Python bindings are thin wrappers:

Node.js (napi-rs): exposes sync and async (tokio::spawn_blocking) variants. The async variants release the Node.js event loop during conversion.
Python (PyO3): exposes py.detach() on the batch function html_to_markdown_many, releasing the GIL for rayon parallel conversion.

The binding crates (mdka-node, mdka-python) have no conversion logic of their own — they call the same Rust functions as the library and CLI.

Features

Crash Resistance

mdka uses non-recursive DFS traversal throughout. An explicit Vec stack replaces the call stack, so documents with arbitrarily deep nesting will not cause a stack overflow. This has been tested with 10,000 levels of nested <div> elements.

Some fast converters use recursive tree traversal and will crash on deeply nested input. If your input source is not fully controlled, crash resistance matters.

Five Conversion Modes

Rather than a single fixed conversion strategy, mdka offers five named modes that tune the pre-processing pipeline:

Balanced — readable output for general use
Strict — maximum attribute retention for debugging
Minimal — body text only; good for LLM input preparation
Semantic — preserves ARIA and document structure
Preserve — maximum fidelity for archiving

Each mode can be further customised with per-call option flags. See Conversion Modes and ConversionOptions.

Parallel File Conversion

html_files_to_markdown and html_files_to_markdown_with use rayon to convert multiple files in parallel. Each file’s result is independent — one failed file does not stop the batch.

The Node.js and Python bindings expose this as an async function (htmlFilesToMarkdown, html_files_to_markdown) so the thread pool work does not block the caller’s event loop or hold the GIL.

Multi-Language API

The same Rust implementation is accessible from three languages:

Language	Package	Mechanism
Rust	`mdka` on crates.io	native library
Node.js	`mdka` on npm	napi-rs native module
Python	`mdka` on PyPI	PyO3 extension module

All three call the same underlying conversion code and produce identical output for identical input.

html5ever Parser Foundation

The HTML parser is scraper, which is built on html5ever. html5ever implements the HTML5 parsing algorithm, the same one that web browsers use.

This means:

Missing closing tags are inferred correctly
Unknown elements are preserved (not silently dropped)
Malformed attribute syntax is normalised
The result is always a valid DOM tree, no matter the input

Predictable, Deterministic Output

For a given HTML input and ConversionOptions, mdka always produces the same Markdown string. There is no randomisation, no date-stamping, and no version-dependent output variation within a semver major version.

Minimal Dependencies

The runtime dependencies of the mdka library crate are:

Crate	Purpose
`scraper`	HTML parsing (html5ever wrapper)
`ego-tree`	DOM tree traversal
`rayon`	Parallel file conversion
`tikv-jemallocator`, `tikv-jemalloc-ctl`	Ensures fragmentation avoidance and scalable concurrency
`thiserror`	`MdkaError` derive macro

Benchmark and comparison dependencies (criterion, competitors) are [dev-dependencies] and do not affect library consumers.

Keyboard shortcuts

mdka — HTML to Markdown converter