mdka
mdka is a HTML to Markdown written in Rust. “ka” means “化 (か)” pointing to conversion.
It aims to strike a practical balance between conversion quality and runtime efficiency — readable output from real-world HTML, without sacrificing speed or memory.
At a Glance
| What you give it | What you get back |
|---|---|
| Any HTML string — a full page, a snippet, CMS output, SPA-rendered DOM | Clean, readable Markdown |
| A list of HTML files | Parallel Markdown output via rayon |
A conversion mode (minimal, semantic, …) | Pre-processed output tuned for your use case |
Key Properties
- Parser foundation: scraper, which is built on html5ever — the same battle-tested parser used by the Servo browser engine. It handles malformed, deeply-nested, and real-world HTML gracefully.
- Crash-resistant: a non-recursive DFS traversal means even 10,000 levels of nesting will not overflow the stack.
- Configurable: five conversion modes let you tune the pre-processing pipeline — from noise-free LLM input to lossless archiving.
- Multi-language: available as a Rust library, a Node.js package (napi-rs), and a Python package (PyO3).
When to Choose mdka
mdka is a good fit if you need:
- Stable, predictable output from diverse HTML sources (CMS, SPA, scraped pages)
- Mode-based pre-processing to strip navigation, preserve ARIA, or retain attributes
- Memory efficiency at scale (bulk file conversion, streaming pipelines)
- Multi-language access from a single underlying Rust implementation
If raw speed on simple, well-formed HTML is the only concern, a streaming rewriter will be faster.
Quick Navigation
- New to mdka? Start with Installation.
- Ready to integrate? Jump to Usage & Examples.
- Evaluating? Read Design Philosophy and Performance Characteristics.
Installation
As a Rust Library
Add mdka to your Cargo.toml:
[dependencies]
mdka = "2"
That is the only step. mdka has no system dependencies.
Minimum Supported Rust Version: 1.85 (2024 Edition)
As a CLI Binary
Build from source using the mdka-cli crate in the workspace:
git clone https://github.com/example/mdka
cd mdka
cargo build --release -p mdka-cli
# Binary: ./target/release/mdka
Or install directly with cargo:
cargo install mdka-cli
As a Node.js Package
npm install mdka
# or
yarn add mdka
Requires Node.js 16 or later.
Pre-built binaries are bundled for major platforms such as Linux, macOS and Windows of specific architecture.
On other platforms, run npm run build with Rust installed.
As a Python Package
pip install mdka
Requires Python 3.8 or later.
Pre-built wheels are provided for CPython on major platforms.
To build from source: pip install mdka --no-binary mdka with Rust installed.
Usage & Examples
Choose the section for your environment:
- Rust — integrate directly into a Rust project
- Node.js — use from JavaScript or TypeScript
- Python — use from Python
- CLI — use from the command line
All four share the same underlying conversion engine, so results are consistent across languages.
Usage — Rust
Basic Conversion
use mdka::html_to_markdown;
fn main() {
let html = r#"
<h1>Getting Started</h1>
<p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
<ul>
<li>Fast</li>
<li>Configurable</li>
<li>Crash-resistant</li>
</ul>
"#;
let md = html_to_markdown(html);
println!("{md}");
}
Output:
# Getting Started
mdka converts **HTML** to *Markdown*.
- Fast
- Configurable
- Crash-resistant
Conversion with Options
Use html_to_markdown_with to control the conversion pipeline via
ConversionOptions.
#![allow(unused)]
fn main() {
use mdka::{html_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};
// Strip navigation and extract body text — good for LLM input
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let html = r#"
<header><nav><a href="/">Home</a></nav></header>
<main>
<article>
<h1>Article Title</h1>
<p>The main content of the page.</p>
</article>
</main>
<footer>Copyright 2025</footer>
"#;
let md = html_to_markdown_with(html, &opts);
assert!(md.contains("# Article Title"));
assert!(!md.contains("Home")); // nav removed
assert!(!md.contains("Copyright")); // footer removed
}
Converting a Single File
#![allow(unused)]
fn main() {
use mdka::html_file_to_markdown;
// Output goes to the same directory as the input: page.html → page.md
let result = html_file_to_markdown("page.html", None::<&str>)?;
println!("{} → {}", result.src.display(), result.dest.display());
// Output goes to a specific directory
let result = html_file_to_markdown("page.html", Some("out/"))?;
}
Bulk Parallel Conversion
#![allow(unused)]
fn main() {
use mdka::html_files_to_markdown;
use std::path::Path;
let files = vec!["a.html", "b.html", "c.html"];
let out_dir = Path::new("out/");
std::fs::create_dir_all(out_dir)?;
for (src, result) in html_files_to_markdown(&files, out_dir) {
match result {
Ok(dest) => println!("{} → {}", src, dest.display()),
Err(e) => eprintln!("Error: {src}: {e}"),
}
}
}
Conversion runs in parallel using rayon. The number of threads defaults to the number of logical CPU cores.
Bulk Conversion with Options
#![allow(unused)]
fn main() {
use mdka::{html_files_to_markdown_with};
use mdka::options::{ConversionMode, ConversionOptions};
use std::path::Path;
let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let files = vec!["a.html", "b.html"];
let results = html_files_to_markdown_with(&files, Path::new("out/"), &opts);
}
Conversion Modes at a Glance
| Mode | Best for |
|---|---|
Balanced | General use; default |
Strict | Debugging, diff comparison |
Minimal | LLM pre-processing, compression |
Semantic | SPA content, accessibility-aware output |
Preserve | Archiving, audit trails |
See Conversion Modes for full details.
Error Handling
#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};
match html_file_to_markdown("missing.html", None::<&str>) {
Ok(result) => println!("→ {}", result.dest.display()),
Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}
MdkaError currently has one variant: Io, wrapping std::io::Error.
html_to_markdown and html_to_markdown_with are infallible — they always
return a String and never panic on any input, no matter how malformed.
Usage — Node.js
Installation
npm install mdka
Basic Conversion
const { htmlToMarkdown } = require('mdka')
const html = `
<h1>Hello</h1>
<p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
`
const md = htmlToMarkdown(html)
console.log(md)
// # Hello
//
// mdka converts **HTML** to *Markdown*.
Async Conversion
htmlToMarkdownAsync offloads work to a Rust thread pool, keeping the
Node.js event loop free:
const { htmlToMarkdownAsync } = require('mdka')
const md = await htmlToMarkdownAsync(html)
// Concurrent conversion of many pages
const results = await Promise.all(pages.map(p => htmlToMarkdownAsync(p.html)))
Conversion with Options
const { htmlToMarkdownWith, htmlToMarkdownWithAsync } = require('mdka')
// Strip nav/header/footer — useful for content extraction
const md = htmlToMarkdownWith(html, {
mode: 'minimal',
dropInteractiveShell: true,
})
// Async version
const md = await htmlToMarkdownWithAsync(html, { mode: 'semantic' })
Available mode strings: "balanced" (default), "strict", "minimal",
"semantic", "preserve".
Single File Conversion
const { htmlFileToMarkdown, htmlFileToMarkdownWith } = require('mdka')
// Output to same directory: page.html → page.md
const result = await htmlFileToMarkdown('page.html')
console.log(`${result.src} → ${result.dest}`)
// Output to specific directory
const result = await htmlFileToMarkdown('page.html', 'out/')
// With options
const result = await htmlFileToMarkdownWith('page.html', 'out/', {
mode: 'minimal',
dropInteractiveShell: true,
})
Bulk Parallel Conversion
const { htmlFilesToMarkdown, htmlFilesToMarkdownWith } = require('mdka')
const path = require('path')
const files = ['a.html', 'b.html', 'c.html']
const results = await htmlFilesToMarkdown(files, 'out/')
for (const r of results) {
if (r.error) console.error(`${r.src}: ${r.error}`)
else console.log(`${r.src} → ${r.dest}`)
}
// With options
const results = await htmlFilesToMarkdownWith(files, 'out/', {
mode: 'semantic',
preserveAriaAttrs: true,
})
TypeScript
Type definitions are bundled. No @types/ package is needed:
import {
htmlToMarkdown,
htmlToMarkdownWith,
htmlToMarkdownAsync,
htmlFileToMarkdown,
htmlFilesToMarkdown,
ConversionOptions,
ConvertResult,
} from 'mdka'
const opts: ConversionOptions = {
mode: 'minimal',
dropInteractiveShell: true,
}
const md: string = htmlToMarkdownWith(html, opts)
Usage — Python
Installation
pip install mdka
Basic Conversion
import mdka
html = """
<h1>Hello</h1>
<p>mdka converts <strong>HTML</strong> to <em>Markdown</em>.</p>
"""
md = mdka.html_to_markdown(html)
print(md)
# # Hello
#
# mdka converts **HTML** to *Markdown*.
Conversion with Options
import mdka
# Strip nav/header/footer — useful for LLM pre-processing
md = mdka.html_to_markdown_with(
html,
mode=mdka.ConversionMode.Minimal,
drop_interactive_shell=True,
)
# Preserve ARIA attributes for accessibility-aware output
md = mdka.html_to_markdown_with(
html,
mode=mdka.ConversionMode.Semantic,
preserve_aria_attrs=True,
)
Available modes: ConversionMode.Balanced (default), Strict, Minimal,
Semantic, Preserve.
Parallel Batch Conversion (GIL released)
html_to_markdown_many releases the GIL and uses rayon for parallel conversion:
import mdka
pages = ["<h1>A</h1>", "<p>B</p>", "<ul><li>C</li></ul>"]
results = mdka.html_to_markdown_many(pages)
# ['# A\n', 'B\n', '- C\n']
This is faster than calling html_to_markdown in a Python loop for large batches.
Single File Conversion
import mdka
# Output to same directory: page.html → page.md
result = mdka.html_file_to_markdown("page.html")
print(f"{result.src} → {result.dest}")
# Output to a specific directory
result = mdka.html_file_to_markdown("page.html", "out/")
# With options
result = mdka.html_file_to_markdown(
"page.html",
"out/",
mode=mdka.ConversionMode.Minimal,
drop_interactive_shell=True,
)
Bulk File Conversion
import mdka
files = ["a.html", "b.html", "c.html"]
results = mdka.html_files_to_markdown(files, "out/")
for r in results:
if r.ok:
print(f"{r.src} → {r.dest}")
else:
print(f"Error: {r.src}: {r.error}")
Error Handling
import mdka
try:
result = mdka.html_file_to_markdown("missing.html")
except mdka.MdkaError as e:
print(f"Conversion failed: {e}")
MdkaError is raised for IO errors (file not found, permission denied, etc.).
html_to_markdown and html_to_markdown_with are always safe to call — they
never raise exceptions regardless of input quality.
Type Annotations
mdka ships with a py.typed marker (PEP 561). All public symbols are annotated:
from mdka import (
html_to_markdown, # (html: str) -> str
html_to_markdown_with, # (html: str, mode=..., **flags) -> str
html_to_markdown_many, # (html_list: list[str]) -> list[str]
html_file_to_markdown, # (path, out_dir=None, ...) -> ConvertResult
html_files_to_markdown, # (paths, out_dir, ...) -> list[BulkConvertResult]
ConversionMode, # enum
ConvertResult, # dataclass: src, dest (str)
BulkConvertResult, # dataclass: src, dest?, error?, ok
MdkaError, # exception
)
Usage — CLI
The mdka command-line tool is provided by the mdka-cli crate.
Quick Reference
mdka [OPTIONS] [FILE...]
Run mdka --help to see the full option list with descriptions.
Common Patterns
Convert from stdin:
echo '<h1>Hello</h1>' | mdka
curl https://example.com | mdka
Convert a single file (output goes to the same directory):
mdka page.html # → page.md
Convert to a specific directory:
mdka -o out/ page.html # → out/page.md
Bulk conversion (-o is required for multiple files):
mdka -o out/ docs/*.html
Choose a conversion mode:
mdka --mode minimal --drop-shell page.html # extract body text
mdka --mode preserve -o archive/ *.html # maximum fidelity
All Options
| Flag | Description |
|---|---|
-o, --output <DIR> | Output directory (default: same as input) |
-m, --mode <MODE> | balanced · strict · minimal · semantic · preserve |
--preserve-ids | Keep id attributes |
--preserve-classes | Keep class attributes |
--preserve-data | Keep data-* attributes |
--preserve-aria | Keep aria-* attributes |
--drop-shell | Remove nav, header, footer, aside |
-h, --help | Show help |
For full mode descriptions see Conversion Modes.
API Reference
mdka exposes a small, focused public API. The table below shows the complete surface — every function and type you need, nothing you don’t.
Functions
| Function | Language | Description |
|---|---|---|
html_to_markdown | Rust | Convert HTML string → Markdown (default mode) |
html_to_markdown_with | Rust | Convert with explicit ConversionOptions |
html_file_to_markdown | Rust | Convert one file; output alongside input or to out_dir |
html_file_to_markdown_with | Rust | Single file with options |
html_files_to_markdown | Rust | Parallel bulk conversion (rayon) |
html_files_to_markdown_with | Rust | Bulk with options |
Types
| Type | Description |
|---|---|
ConversionMode | Enum: Balanced · Strict · Minimal · Semantic · Preserve |
ConversionOptions | Controls pre-processing per-call; built via for_mode() |
ConvertResult | Returned by single-file functions: src + dest paths |
MdkaError | The only error type: wraps std::io::Error |
Guarantees
html_to_markdownandhtml_to_markdown_withnever panic. They accept any&str, including empty strings, binary garbage, or deeply nested HTML.- File functions propagate IO errors via
Result<_, MdkaError>. - Output is always valid UTF-8.
- Output always ends with a single newline when the input produces any content.
Core Functions
html_to_markdown
#![allow(unused)]
fn main() {
pub fn html_to_markdown(html: &str) -> String
}
Converts an HTML string to Markdown using the default Balanced mode.
Input: Any valid or malformed HTML string. Empty strings are accepted.
Output: A Markdown string. Always ends with \n if the input produced any content.
Errors: None — this function is infallible.
#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown("<h1>Hello</h1>");
assert_eq!(md, "# Hello\n");
}
html_to_markdown_with
#![allow(unused)]
fn main() {
pub fn html_to_markdown_with(html: &str, opts: &ConversionOptions) -> String
}
Same as html_to_markdown, but accepts a ConversionOptions
value that controls pre-processing and conversion behaviour.
Input: Any HTML string + a ConversionOptions value.
Output: Markdown string.
Errors: None.
#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}
html_file_to_markdown
#![allow(unused)]
fn main() {
pub fn html_file_to_markdown(
path: impl AsRef<Path>,
out_dir: Option<impl AsRef<Path>>,
) -> Result<ConvertResult, MdkaError>
}
Reads one HTML file, converts it, and writes a .md file.
path: Path to the input .html file.out_dir:
None→ the.mdfile is written alongside the input (same directory, stem unchanged).Some(dir)→ the.mdfile is written intodir. The directory is created automatically if it does not exist.
Returns: ConvertResult with the resolved src and dest paths.
Errors: MdkaError::Io if the file cannot be read or the output cannot be written.
#![allow(unused)]
fn main() {
// page.html → page.md in the same folder
let r = mdka::html_file_to_markdown("page.html", None::<&str>)?;
// page.html → out/page.md
let r = mdka::html_file_to_markdown("page.html", Some("out/"))?;
println!("{} → {}", r.src.display(), r.dest.display());
}
html_file_to_markdown_with
#![allow(unused)]
fn main() {
pub fn html_file_to_markdown_with(
path: impl AsRef<Path>,
out_dir: Option<impl AsRef<Path>>,
opts: &ConversionOptions,
) -> Result<ConvertResult, MdkaError>
}
Same as html_file_to_markdown, but applies the given ConversionOptions.
html_files_to_markdown
#![allow(unused)]
fn main() {
pub fn html_files_to_markdown<'a, P>(
paths: &'a [P],
out_dir: &Path,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
P: AsRef<Path> + Sync,
}
Converts multiple HTML files in parallel using rayon.
paths: Slice of paths to input HTML files.out_dir: Directory for all output .md files. Must exist before calling (unlike single-file variants which create it automatically).
Returns: A Vec of (input_path, Result<output_path, error>) pairs in the same order as paths. Each element represents the outcome for one file independently.
#![allow(unused)]
fn main() {
use std::path::Path;
let files = vec!["a.html", "b.html", "c.html"];
std::fs::create_dir_all("out/")?;
for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
match result {
Ok(dest) => println!("{} → {}", src, dest.display()),
Err(e) => eprintln!("{src}: {e}"),
}
}
}
html_files_to_markdown_with
#![allow(unused)]
fn main() {
pub fn html_files_to_markdown_with<'a, P>(
paths: &'a [P],
out_dir: &Path,
opts: &ConversionOptions,
) -> Vec<(&'a P, Result<PathBuf, MdkaError>)>
where
P: AsRef<Path> + Sync,
}
Same as html_files_to_markdown, but applies the given ConversionOptions to every file.
ConvertResult
#![allow(unused)]
fn main() {
pub struct ConvertResult {
pub src: PathBuf,
pub dest: PathBuf,
}
}
Returned by the single-file functions. Both fields are absolute or relative paths
depending on how path was passed in.
Note: The bulk functions (
html_files_to_markdown*) return(&P, Result<PathBuf, MdkaError>)tuples rather thanConvertResult, because individual files within a batch may fail independently.
Conversion Modes
A conversion mode determines how mdka pre-processes the parsed DOM before converting to Markdown. Choose the mode that matches the origin and purpose of your HTML.
Overview
| Mode | Best for | Default? |
|---|---|---|
Balanced | General use, blog posts, documentation pages | ✅ Yes |
Strict | Debugging, comparing before/after, diff-friendly output | |
Minimal | LLM pre-processing, text extraction, compression | |
Semantic | SPA output, accessibility-aware pipelines, screen-reader content | |
Preserve | Archiving, audit trails, round-trip fidelity |
Balanced (default)
Goal: Produce clean, readable Markdown without losing meaningful content.
- Removes decorative attributes:
class,style,data-* - Keeps semantic attributes:
href,src,alt,aria-*,lang,dir - Keeps
idattributes (useful for anchor links) - Does not remove navigation or structural elements
Use when: You want good-looking output without extra configuration.
#![allow(unused)]
fn main() {
let md = mdka::html_to_markdown(html); // Balanced is the default
}
Strict
Goal: Preserve as much of the original HTML information as possible. Output may be noisier, but nothing is silently dropped.
- Keeps
class,data-*,id,aria-*, and most other attributes - Does not unwrap wrapper elements
- Suitable for comparing two versions of a page, or for debugging unexpected output from other modes
#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};
let opts = ConversionOptions::for_mode(ConversionMode::Strict);
let md = mdka::html_to_markdown_with(html, &opts);
}
Minimal
Goal: Extract the body text and essential structure; discard everything else.
- Removes all decorative attributes (
class,style,data-*,aria-*) - Optionally removes shell elements (
nav,header,footer,aside) whendrop_interactive_shellistrue - Unwraps generic wrappers (
div,span,section,article) that add no meaning - Ideal for piping content into an LLM prompt or a search index
#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Minimal);
opts.drop_interactive_shell = true;
let md = mdka::html_to_markdown_with(html, &opts);
}
Semantic
Goal: Preserve document meaning and accessibility structure over visual appearance.
- Strongly retains
aria-*attributes - Retains
langanddir - Retains heading hierarchy, list structure, link targets, and image alt text
- Removes purely visual attributes (
class,style) - Unwraps anonymous wrappers
- Good for SPA-rendered HTML where ARIA attributes carry structural meaning
#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Semantic);
let md = mdka::html_to_markdown_with(html, &opts);
}
Preserve
Goal: Maximum fidelity to the original HTML. Lose as little information as possible.
- Retains all attributes, including
class,data-*,aria-*,id, and unknowns - Retains HTML comments in the pre-processed output
- Does not unwrap any elements
- Intended for archiving or audit scenarios where the original structure matters
#![allow(unused)]
fn main() {
let opts = ConversionOptions::for_mode(ConversionMode::Preserve);
let md = mdka::html_to_markdown_with(html, &opts);
}
Choosing a Mode
Is reproducibility the goal? → Preserve
Are you feeding content to an LLM? → Minimal (+drop_interactive_shell)
Is the source a SPA or ARIA-heavy? → Semantic
Debugging unexpected output? → Strict
Everything else → Balanced (default)
ConversionOptions
#![allow(unused)]
fn main() {
pub struct ConversionOptions {
pub mode: ConversionMode,
// Attribute retention
pub preserve_ids: bool,
pub preserve_classes: bool,
pub preserve_data_attrs: bool,
pub preserve_aria_attrs: bool,
pub preserve_unknown_attrs: bool,
// Pre-processing behaviour
pub drop_presentation_attrs: bool,
pub drop_interactive_shell: bool,
pub unwrap_unknown_wrappers: bool,
}
}
ConversionOptions controls every detail of the pre-processing pipeline.
You rarely need to set individual fields — start with a mode and override
only what differs from the default for that mode.
Creating Options
From a mode (recommended)
#![allow(unused)]
fn main() {
use mdka::options::{ConversionMode, ConversionOptions};
let opts = ConversionOptions::for_mode(ConversionMode::Minimal);
}
for_mode returns sensible defaults for the chosen mode. See the table below.
Modify fields after creation
#![allow(unused)]
fn main() {
let mut opts = ConversionOptions::for_mode(ConversionMode::Balanced);
opts.drop_interactive_shell = true; // also strip nav/header/footer/aside
opts.preserve_ids = false; // don't keep id= attributes
opts.preserve_aria_attrs = true; // (already true in Balanced, shown for clarity)
}
Default
#![allow(unused)]
fn main() {
let opts = ConversionOptions::default(); // equivalent to for_mode(Balanced)
}
Field Defaults by Mode
| Field | Balanced | Strict | Minimal | Semantic | Preserve |
|---|---|---|---|---|---|
preserve_ids | ✅ | ✅ | ❌ | ✅ | ✅ |
preserve_classes | ❌ | ✅ | ❌ | ❌ | ✅ |
preserve_data_attrs | ❌ | ✅ | ❌ | ❌ | ✅ |
preserve_aria_attrs | ✅ | ✅ | ❌ | ✅ | ✅ |
preserve_unknown_attrs | ❌ | ✅ | ❌ | ❌ | ✅ |
drop_presentation_attrs | ✅ | ❌ | ✅ | ✅ | ❌ |
drop_interactive_shell | ❌ | ❌ | ✅ | ❌ | ❌ |
unwrap_unknown_wrappers | ❌ | ❌ | ✅ | ✅ | ❌ |
Field Reference
mode
The ConversionMode this options object was built from.
Changing mode after construction does not re-apply mode defaults
to the other fields — use for_mode() again instead.
preserve_ids
Whether to keep id="…" attributes in the pre-processed DOM.
Useful when the output is rendered in a context that relies on
anchor links (#section-name).
preserve_classes
Whether to keep class="…" attributes.
Rarely useful in Markdown output, but can help when feeding the
Markdown back into an HTML renderer that applies CSS.
preserve_data_attrs
Whether to keep data-* custom attributes.
Mostly relevant for Strict and Preserve modes.
preserve_aria_attrs
Whether to keep aria-* accessibility attributes.
Enabled by default in Balanced, Strict, Semantic, and Preserve.
The attributes themselves do not appear in Markdown output, but they
are used by the Semantic mode’s conversion logic.
preserve_unknown_attrs
Whether to keep attributes not otherwise handled (everything except
href, src, alt, title, aria-*, data-*, id, class, style).
drop_presentation_attrs
Whether to remove style and other purely visual attributes during pre-processing.
Enabled in Balanced, Minimal, and Semantic.
drop_interactive_shell
Whether to remove <nav>, <header>, <footer>, and <aside> elements
and all their children.
Useful for content extraction from full web pages.
Disabled by default in all modes; opt in explicitly.
unwrap_unknown_wrappers
Whether to replace generic container elements (<div>, <span>,
<section>, <article>, <main>) with their children when they
carry no structural meaning. Enabled in Minimal and Semantic.
Error Handling
MdkaError
#![allow(unused)]
fn main() {
#[derive(Error, Debug)]
pub enum MdkaError {
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
}
}
MdkaError is the only error type in mdka. It has one variant, Io,
which wraps a std::io::Error.
IO errors arise from the file-based functions when:
- the input file does not exist or is not readable
- the output directory cannot be created
- the output file cannot be written
Infallible Functions
html_to_markdown and html_to_markdown_with never fail. They accept
any string and return a String. Malformed HTML, empty input, binary-looking
content, deeply nested structures — none of these cause a panic or an error.
Pattern Matching
#![allow(unused)]
fn main() {
use mdka::{html_file_to_markdown, MdkaError};
match html_file_to_markdown("page.html", None::<&str>) {
Ok(result) => println!("→ {}", result.dest.display()),
Err(MdkaError::Io(e)) => eprintln!("IO error: {e}"),
}
}
Because there is only one variant today, you can also use ? directly:
#![allow(unused)]
fn main() {
let result = mdka::html_file_to_markdown("page.html", None::<&str>)?;
}
Bulk Conversion Errors
In html_files_to_markdown, each file fails independently.
A failed file does not abort the rest of the batch:
#![allow(unused)]
fn main() {
for (src, result) in mdka::html_files_to_markdown(&files, Path::new("out/")) {
if let Err(e) = result {
eprintln!("skipped {}: {e}", src);
}
}
}
Supported HTML Elements
The table below shows every HTML element that mdka recognises and what Markdown it produces. Elements not listed are either silently removed (script, style, etc.) or their children are kept as plain text.
Block Elements
| HTML | Markdown output | Notes |
|---|---|---|
<h1> – <h6> | # – ###### | ATX-style headings |
<p> | Paragraph (blank lines around) | |
<blockquote> | > prefix | Nesting produces > > , > > > , … |
<pre><code> | Fenced code block ``` | Preserves whitespace and newlines |
<ul> | - list | Nested lists indented by 2 spaces |
<ol> | 1. list | Respects start attribute |
<li> | List item | |
<hr> | --- | |
<div>, <article>, <section>, <main>, <figure>, <figcaption> | Block separator | Act as paragraph breaks; unwrapped in Minimal/Semantic |
Inline Elements
| HTML | Markdown output | Notes |
|---|---|---|
<strong>, <b> | **text** | |
<em>, <i> | *text* | |
<code> (inline) | `text` | Only when not inside <pre> |
<a href="…"> | [text](url) | title attribute → [text](url "title") |
<img src="…" alt="…"> |  | title attribute →  |
<br> | \n (trailing two spaces + newline) |
Code Blocks and Language Hints
When a <code> element has a class containing language-<name>, the
language name is included in the fenced block:
<pre><code class="language-rust">fn main() {}</code></pre>
Produces:
```rust
fn main() {}
```
The language-* class is preserved in all conversion modes, including
Balanced which otherwise strips class attributes.
Always-Removed Elements
These elements and all their descendants are removed unconditionally, regardless of conversion mode:
<script> · <style> · <meta> · <link> · <template> ·
<iframe> · <object> · <embed> · <noscript>
HTML comments are removed in all modes except Preserve, where they
are retained as <!-- … --> in the pre-processed DOM (though they do not
appear in Markdown output).
Shell Elements
<nav>, <header>, <footer>, <aside> are kept by default but can
be removed by setting drop_interactive_shell = true
or using ConversionMode::Minimal.
Text Processing Rules
mdka applies a small set of deterministic rules to produce consistent, readable Markdown from any HTML text content.
Whitespace Normalisation
HTML text nodes are normalised according to the HTML whitespace collapsing rules:
- Leading and trailing whitespace is trimmed from block-level context.
- Consecutive whitespace characters (spaces, tabs, newlines) within a text node are collapsed to a single space.
- A single space is preserved between adjacent inline elements.
<br>produces a hard line break (\n).<pre>blocks are exempt — whitespace inside<pre>is reproduced exactly.
This is done in a single pass without regular expressions, which keeps allocation overhead low.
Markdown Character Escaping
To prevent accidental Markdown formatting, the following characters are escaped with a backslash when they appear in text content that is not inside a code span or code block:
| Character | Escaped as | Context |
|---|---|---|
* | \* | Would create emphasis |
_ | \_ | Would create emphasis |
` | \` | Would start a code span |
# | \# | At the start of a line, would create a heading |
[ | \[ | Would start a link |
! | \! | Before [, would start an image |
\ | \\ | The escape character itself |
Escaping is context-aware: a # in the middle of a line is not escaped,
only at the start of a line where it would be interpreted as an ATX heading.
HTML Entity Decoding
HTML entities in text nodes are decoded by the HTML parser (scraper / html5ever) before mdka processes them. The result is already Unicode text:
| HTML entity | After parsing | In Markdown |
|---|---|---|
& | & | & |
< | < | < |
> | > | > |
| non-breaking space | preserved as space |
Output Boundaries
- Output always ends with exactly one newline (
\n) when the input produces any content; the output is empty for empty input. - Leading blank lines that scraper adds when wrapping content in
<html><body>are trimmed before the final string is returned. - Block elements (paragraphs, headings, lists, etc.) are separated by blank lines.
Design Philosophy
The Goal: Balance, not Dominance
There are excellent HTML-to-Markdown libraries in the Rust ecosystem — some prioritise raw speed, others maximise conversion fidelity. mdka is not trying to beat them on every axis.
Its aim is a practical balance:
Produce stable, readable Markdown from real-world HTML, with an easy API, without surprising the caller at runtime.
Speed and memory efficiency matter, and mdka is designed with both in mind. But they are means to an end, not the end itself.
Real-World HTML is Messy
Web content rarely arrives as clean, well-formed documents. In practice you encounter:
- HTML that a CMS generated and no human ever wrote
- SPA-rendered DOM fragments extracted from DevTools
- Scraped pages with ad slots, cookie banners, and navigation wrapped around the content
- Documents with 5,000 levels of nested
<div>elements - Missing closing tags, duplicate attributes, and unknown elements
mdka uses scraper, which is built on html5ever — the same parser used by the Servo browser engine. It applies the HTML5 parsing algorithm, meaning: unknown elements are handled gracefully, missing tags are inferred, and the result is always a well-formed DOM tree, regardless of the input quality.
No Stack Overflows
A common failure mode in tree-processing code is stack overflow on deeply
nested input. mdka uses an explicit Vec-based stack (non-recursive DFS)
for every tree traversal — both in the pre-processing pipeline and in the
Markdown conversion step. This means it handles any nesting depth that
fits in heap memory.
Configurable Pre-Processing
HTML from different sources needs different treatment. A page scraped from a news site has navigation, advertising, and footer content that a content extraction pipeline wants to remove. A document being archived for audit purposes should retain as much as possible.
The five conversion modes encode these intent differences as named, opinionated presets. They are applied in a pre-processing pass that filters the DOM before Markdown conversion runs — keeping the conversion logic itself simple and mode-agnostic.
One Allocator, Minimal Copies
The conversion pipeline is designed to minimise heap allocations:
- Whitespace normalisation is done in a single pass, writing directly into
the output
String. - No regular expressions are used at runtime (avoiding compiled regex objects).
- The output
Stringis pre-allocated with a capacity estimate. - The
#[global_allocator]counter in the CLI and benchmarks measures this directly.
Performance Characteristics
The Focus of mdka
The Rust ecosystem offers a variety of excellent HTML-to-Markdown converters. Many of these projects prioritize feature-richness, complex edge-case handling, or high extensibility.
mdka takes a different approach. Our mission is to provide a “minimalist, lightweight, and memory-efficient” converter, specifically optimized for resource-constrained environments or high-concurrency tasks where overhead must be kept to an absolute minimum.
The benchmarks presented here are not intended to rank libraries or declare a “winner.” Instead, they serve as internal metrics to verify whether mdka is successfully meeting its own design goals. We believe in choosing the right tool for the specific job, and we encourage developers to explore the diverse range of libraries available in the ecosystem to find the one that best fits their needs.
The Evolution: v1 to v2
With the release of v2, mdka underwent a complete architectural overhaul. We moved away from the original implementation to a ground-up rewrite focused on:
- Stack-Safe Traversal: Implementing a non-recursive Deep First Search (DFS) to prevent stack overflow even with deeply nested HTML.
- Optimized Memory Allocation: Reducing unnecessary clones and leveraging Rust’s ownership model to minimize peak memory usage.
- Streamlined Processing: Simplifying the conversion logic to achieve a predictable and lightweight execution path.
This rewrite resulted in a dramatic performance leap and a significantly reduced memory footprint compared to our previous version.
Benchmark Results (2026-04-15)
The following data demonstrates how the v2 architecture has improved our efficiency and how it aligns with our goal of “reasonable speed with minimal resource consumption.”
The figures below are wall-clock medians from Criterion. The log also records outliers for each run, so small differences should be read with some caution.
Conditions
All libraries were benchmarked under the same conditions:
Linux x86_64 6.19, Rust 1.94.1, Criterion 0.8, 28 logical cores, 3 s warm-up, and 3 s measurement.
Libraries Under Test
| Library | Version | HTML parser | Approach |
|---|---|---|---|
| mdka | 2.0.0 | scraper (html5ever) | Full DOM tree; non-recursive DFS |
| mdka_v1 | 1.6.9 | html5ever | Full DOM tree; older implementation |
| html2md | 0.2.15 | html5ever | DOM-based converter |
| fast_html2md | 0.0.61 | lol_html | Streaming rewriter |
| htmd | 0.5.4 | html5ever | DOM-based converter |
| html_to_markdown_rs | 3.1.0 | html5ever | DOM-based converter |
| html2text | 0.16.7 | html5ever | Text-oriented converter |
| dom_smoothie | 0.17.0 | dom_query (html5ever) | DOM-oriented converter |
These libraries do not share the same design and do have different approach and goals.
Conversion Speed
| Dataset | mdka v2 | mdka v1 | html2md | fast_html2md | htmd | html_to_markdown_rs | html2text | dom_smoothie |
|---|---|---|---|---|---|---|---|---|
| small | 131.52 µs | 131.66 µs | 132.21 µs | 79.50 µs | 90.47 µs | 107.82 µs | 350.92 µs | 317.37 µs |
| medium | 1.3040 ms | 2.2866 ms | 1.5266 ms | 887.59 µs | 1.0562 ms | 1.1660 ms | 3.3999 ms | 2.7643 ms |
| large | 12.336 ms | 75.751 ms | 12.455 ms | 7.0399 ms | 7.7896 ms | 9.6825 ms | 29.854 ms | 26.062 ms |
| deep_nest | 32.620 ms | 373.10 ms | 36.834 ms | 5.9868 ms | 72.481 ms | 96.744 ms | 30.903 ms | 29.408 ms |
| flat | 5.6253 ms | 24.817 ms | 6.7911 ms | 4.2114 ms | 5.5321 ms | 4.6975 ms | 14.023 ms | 29.408 ms |
| malformed | 31.712 µs | 40.178 µs | 71.778 µs | 52.948 µs | 62.302 µs | 41.109 µs | 96.822 µs | 5.6401 ms |
mdka v2 is clearly ahead of mdka v1 in this run. The gain is small on the smallest input, but it becomes much more visible as the input gets larger or structurally harder: around 1.75× faster on medium, 6.1× on large, 11.4× on deep_nest, and 4.4× on flat. On malformed input, v2 is also faster than v1 and the fastest.
Memory Allocation
| Dataset | mdka v2 | mdka_v1 | html2md | fast_html2md | htmd | html_to_markdown_rs | html2text | dom_smoothie |
|---|---|---|---|---|---|---|---|---|
| small | 113.5 KB | 240 KB | 231 KB | 154 KB | 93.6 KB | 232.5 KB | 764.5 KB | 325.4 KB |
| medium | 984.6 KB | 2.03 MB | 1.95 MB | 1.52 MB | 1.01 MB | 1.95 MB | 8.50 MB | 2.85 MB |
| large | 8.00 MB | 17.0 MB | 16.76 MB | 11.98 MB | 7.85 MB | 16.76 MB | 74.89 MB | 23.08 MB |
| deep_nest | 3.00 MB | 4.71 MB | 2.55 MB | 6.85 MB | 1.96 MB | 2.55 MB | 18.48 MB | — |
| flat | 3.93 MB | 7.90 MB | 7.87 MB | 7.46 MB | 4.84 MB | 7.87 MB | 40.28 MB | 35.47 MB |
| malformed | 44.7 KB | 91.6 KB | 71.4 KB | 145 KB | 62.3 KB | 71.4 KB | 464.4 KB | 1.63 MB |
In this run, mdka v2 uses less heap than v1.
Summary
As shown in the results, the transition to v2 has allowed us to achieve our objectives of being lightweight and memory-efficient while maintaining competitive speed.
We recognize that other libraries may offer more features or different trade-offs that make them better suited for certain applications. mdka aims to be the best choice for those who prioritize a simple, “Unix-style” tool that does one thing—conversion—with the smallest possible footprint.
Architecture
Workspace Layout
mdka/
├── src/ mdka library crate (lib only)
│ ├── lib.rs Public API surface
│ ├── options.rs ConversionMode, ConversionOptions
│ ├── traversal.rs Markdown conversion traversal
│ ├── renderer.rs MarkdownRenderer state machine
│ ├── utils.rs Whitespace normalisation + escaping
│ └── alloc_counter.rs Custom allocator (for benchmarks)
├── tests/ integration test modules
│ └── utils/preprocessor.rs DOM pre-processing pipeline
├── cli/ mdka-cli binary crate
│ └── src/main.rs Argument parsing + dispatch
├── node/ Node.js bindings (napi-rs v3)
├── python/ Python bindings (PyO3 v0)
├── benches/ criterion benchmarks
└── examples/ Allocation measurement tool
Conversion Pipeline
Each call to html_to_markdown_with follows these steps:
HTML string
│
▼
[1] Parse scraper::Html::parse_document()
│ → html5ever DOM tree (tolerant HTML5 parsing)
▼
[2] Pre-process preprocessor::preprocess(&doc, opts)
│ → filtered HTML string
│ Non-recursive DFS over ego-tree nodes
│ Drops: script, style, iframe, …
│ Filters attributes per ConversionOptions
│ Removes shell elements (if opted in)
│ Unwraps anonymous wrappers (if opted in)
▼
[3] Re-parse scraper::Html::parse_document(&cleaned)
│ → clean DOM for conversion
▼
[4] Convert traversal::traverse(&doc)
│ → Markdown string
│ Non-recursive DFS with Enter/Leave events
│ Drives MarkdownRenderer via event callbacks
▼
[5] Finalise renderer.finish()
→ trim leading/trailing whitespace
→ ensure single trailing newline
MarkdownRenderer
MarkdownRenderer is a state machine that maintains:
output: the accumulated Markdown stringlist_stack: tracks nested ordered/unordered listsblockquote_depth: counts blockquote nesting levelin_pre: whether inside a<pre>blockat_line_start: deferred prefix flag for blockquote>emissionnewlines_emitted: prevents double-blank-line accumulation
The at_line_start flag is key: rather than emitting > prefixes
immediately when entering a blockquote, the renderer defers them until
actual content is written. This ensures nested blockquotes emit the
correct number of > characters regardless of how many block elements
intervene.
Language Bindings
Both the Node.js and Python bindings are thin wrappers:
- Node.js (napi-rs): exposes sync and async (
tokio::spawn_blocking) variants. The async variants release the Node.js event loop during conversion. - Python (PyO3): exposes
py.detach()on the batch functionhtml_to_markdown_many, releasing the GIL for rayon parallel conversion.
The binding crates (mdka-node, mdka-python) have no conversion logic
of their own — they call the same Rust functions as the library and CLI.
Features
Crash Resistance
mdka uses non-recursive DFS traversal throughout. An explicit Vec stack
replaces the call stack, so documents with arbitrarily deep nesting will
not cause a stack overflow. This has been tested with 10,000 levels of
nested <div> elements.
Some fast converters use recursive tree traversal and will crash on deeply nested input. If your input source is not fully controlled, crash resistance matters.
Five Conversion Modes
Rather than a single fixed conversion strategy, mdka offers five named modes that tune the pre-processing pipeline:
- Balanced — readable output for general use
- Strict — maximum attribute retention for debugging
- Minimal — body text only; good for LLM input preparation
- Semantic — preserves ARIA and document structure
- Preserve — maximum fidelity for archiving
Each mode can be further customised with per-call option flags. See Conversion Modes and ConversionOptions.
Parallel File Conversion
html_files_to_markdown and html_files_to_markdown_with use
rayon to convert multiple files
in parallel. Each file’s result is independent — one failed file does
not stop the batch.
The Node.js and Python bindings expose this as an async function
(htmlFilesToMarkdown, html_files_to_markdown) so the thread pool
work does not block the caller’s event loop or hold the GIL.
Multi-Language API
The same Rust implementation is accessible from three languages:
| Language | Package | Mechanism |
|---|---|---|
| Rust | mdka on crates.io | native library |
| Node.js | mdka on npm | napi-rs native module |
| Python | mdka on PyPI | PyO3 extension module |
All three call the same underlying conversion code and produce identical output for identical input.
html5ever Parser Foundation
The HTML parser is scraper, which is built on html5ever. html5ever implements the HTML5 parsing algorithm, the same one that web browsers use.
This means:
- Missing closing tags are inferred correctly
- Unknown elements are preserved (not silently dropped)
- Malformed attribute syntax is normalised
- The result is always a valid DOM tree, no matter the input
Predictable, Deterministic Output
For a given HTML input and ConversionOptions, mdka always produces
the same Markdown string. There is no randomisation, no date-stamping,
and no version-dependent output variation within a semver major version.
Minimal Dependencies
The runtime dependencies of the mdka library crate are:
| Crate | Purpose |
|---|---|
scraper | HTML parsing (html5ever wrapper) |
ego-tree | DOM tree traversal |
rayon | Parallel file conversion |
tikv-jemallocator, tikv-jemalloc-ctl | Ensures fragmentation avoidance and scalable concurrency |
thiserror | MdkaError derive macro |
Benchmark and comparison dependencies (criterion, competitors) are
[dev-dependencies] and do not affect library consumers.