Goldziher.HtmlToMarkdown
2.16.1
dotnet add package Goldziher.HtmlToMarkdown --version 2.16.1
NuGet\Install-Package Goldziher.HtmlToMarkdown -Version 2.16.1
<PackageReference Include="Goldziher.HtmlToMarkdown" Version="2.16.1" />
<PackageVersion Include="Goldziher.HtmlToMarkdown" Version="2.16.1" />
<PackageReference Include="Goldziher.HtmlToMarkdown" />
paket add Goldziher.HtmlToMarkdown --version 2.16.1
#r "nuget: Goldziher.HtmlToMarkdown, 2.16.1"
#:package Goldziher.HtmlToMarkdown@2.16.1
#addin nuget:?package=Goldziher.HtmlToMarkdown&version=2.16.1
#tool nuget:?package=Goldziher.HtmlToMarkdown&version=2.16.1
html-to-markdown
High-performance HTML โ Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behaviour.
๐ฎ Try the Live Demo โ
Experience WebAssembly-powered HTML to Markdown conversion instantly in your browser. No installation needed!
Why html-to-markdown?
- Blazing Fast: Rust-powered core delivers 10-80ร faster conversion than pure Python alternatives
- Universal: Works everywhere - Node.js, Bun, Deno, browsers, Python, Rust, and standalone CLI
- Smart Conversion: Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
- Metadata Extraction: Extract document metadata (title, description, headers, links, images) alongside conversion
- Highly Configurable: Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
- Tag Preservation: Keep specific HTML tags unconverted when markdown isn't expressive enough
- Secure by Default: Built-in HTML sanitization prevents malicious content
- Consistent Output: Identical markdown rendering across all language bindings
Documentation
Language Guides & API References:
- Python โ README with metadata extraction, inline images, hOCR workflows
- JavaScript/TypeScript โ Node.js | TypeScript | WASM
- Ruby โ README with RBS types, Steep type checking
- PHP โ Package | Extension (PIE)
- Go โ README with FFI bindings
- Java โ README with Panama FFI, Maven/Gradle setup
- C#/.NET โ README with NuGet distribution
- Elixir โ README with Rustler NIF bindings
- Rust โ README with core API, error handling, advanced features
Project Resources:
- Contributing โ CONTRIBUTING.md โญ Start here for development
- Changelog โ CHANGELOG.md โ Version history and breaking changes
Installation
| Target | Command(s) |
|---|---|
| Node.js/Bun (native) | npm install html-to-markdown-node |
| WebAssembly (universal) | npm install html-to-markdown-wasm |
| Deno | import { convert } from "npm:html-to-markdown-wasm" |
| Python (bindings + CLI) | pip install html-to-markdown |
| PHP (extension + helpers) | PHP_EXTENSION_DIR=$(php-config --extension-dir) pie install goldziher/html-to-markdown<br>composer require goldziher/html-to-markdown |
| Ruby gem | bundle add html-to-markdown or gem install html-to-markdown |
| Elixir (Rustler NIF) | {:html_to_markdown, "~> 2.8"} |
| Rust crate | cargo add html-to-markdown-rs |
| Rust CLI (crates.io) | cargo install html-to-markdown-cli |
| Homebrew CLI | brew install html-to-markdown (core) |
| Releases | GitHub Releases |
Quick Start
JavaScript/TypeScript
Node.js / Bun (Native - Fastest):
import { convert } from 'html-to-markdown-node';
const html = '<h1>Hello</h1><p>Rust โค๏ธ Markdown</p>';
const markdown = convert(html, {
headingStyle: 'Atx',
codeBlockStyle: 'Backticks',
wrap: true,
preserveTags: ['table'], // NEW in v2.5: Keep complex HTML as-is
});
Deno / Browsers / Edge (Universal):
import { convert } from "npm:html-to-markdown-wasm"; // Deno
// or: import { convert } from 'html-to-markdown-wasm'; // Bundlers
const markdown = convert(html, {
headingStyle: 'atx',
listIndentWidth: 2,
});
Performance: The shared fixture harness now lives in tools/benchmark-harness and is used to track Rust + binding throughput over time.
See the JavaScript guides for full API documentation:
Metadata extraction (all languages)
import { convertWithMetadata } from 'html-to-markdown-node';
const html = `
<html>
<head>
<title>Example</title>
<meta name="description" content="Demo page">
<link rel="canonical" href="https://example.com/page">
</head>
<body>
<h1 id="welcome">Welcome</h1>
<a href="https://example.com" rel="nofollow external">Example link</a>
<img src="https://example.com/image.jpg" alt="Hero" width="640" height="480">
</body>
</html>
`;
const { markdown, metadata } = await convertWithMetadata(
html,
{ headingStyle: 'Atx' },
{ extract_links: true, extract_images: true, extract_headers: true },
);
console.log(markdown);
// metadata.document.title === 'Example'
// metadata.links[0].rel === ['nofollow', 'external']
// metadata.images[0].dimensions === [640, 480]
Equivalent APIs are available in every binding:
- Python:
convert_with_metadata(html, options=None, metadata_config=None) - Ruby:
HtmlToMarkdown.convert_with_metadata(html, options = nil, metadata_config = nil) - PHP:
convert_with_metadata(string $html, ?array $options = null, ?array $metadataConfig = null)
CLI
# Convert a file
html-to-markdown input.html > output.md
# Stream from stdin
curl https://example.com | html-to-markdown > output.md
# Apply options
html-to-markdown --heading-style atx --list-indent-width 2 input.html
# Fetch a remote page (HTTP) with optional custom User-Agent
html-to-markdown --url https://example.com > output.md
html-to-markdown --url https://example.com --user-agent "Mozilla/5.0" > output.md
Metadata Extraction
Extract document metadata alongside HTML-to-Markdown conversion. All bindings support identical APIs:
CLI Examples
# Basic metadata extraction with conversion
html-to-markdown input.html --with-metadata -o output.json
# Extract document metadata (title, description, language, etc.)
html-to-markdown input.html --with-metadata --extract-document
# Extract headers and links
html-to-markdown input.html --with-metadata --extract-headers --extract-links
# Extract all metadata types with conversion
html-to-markdown input.html --with-metadata \
--extract-document \
--extract-headers \
--extract-links \
--extract-images \
--extract-structured-data \
-o metadata.json
# Fetch and extract from remote URL
html-to-markdown --url https://example.com --with-metadata -o output.json
# Web scraping with preprocessing and metadata
html-to-markdown page.html --preprocess --preset aggressive \
--with-metadata --extract-links --extract-images
Output format (JSON):
{
"markdown": "# Title\n\nContent here...",
"metadata": {
"document": {
"title": "Page Title",
"description": "Meta description",
"charset": "utf-8",
"language": "en"
},
"headers": [
{ "level": 1, "text": "Title", "id": "title" }
],
"links": [
{
"text": "Example",
"href": "https://example.com",
"title": null,
"rel": ["external"]
}
],
"images": [
{
"src": "https://example.com/image.jpg",
"alt": "Hero image",
"title": null,
"dimensions": [640, 480]
}
]
}
}
Python Example
from html_to_markdown import convert_with_metadata
html = '''
<html>
<head>
<title>Product Guide</title>
<meta name="description" content="Complete product documentation">
</head>
<body>
<h1>Getting Started</h1>
<p>Visit our <a href="https://example.com">website</a> for more.</p>
<img src="https://example.com/guide.jpg" alt="Setup diagram" width="800" height="600">
</body>
</html>
'''
markdown, metadata = convert_with_metadata(
html,
options={'heading_style': 'Atx'},
metadata_config={
'extract_document': True,
'extract_headers': True,
'extract_links': True,
'extract_images': True,
}
)
print(markdown)
print(f"Title: {metadata['document']['title']}")
print(f"Links found: {len(metadata['links'])}")
TypeScript/Node.js Example
import { convertWithMetadata } from 'html-to-markdown-node';
const html = `
<html>
<head>
<title>Article</title>
<meta name="description" content="Tech article">
</head>
<body>
<h1>Web Performance</h1>
<p>Read our <a href="/blog">blog</a> for tips.</p>
<img src="/perf.png" alt="Chart" width="1200" height="630">
</body>
</html>
`;
const { markdown, metadata } = await convertWithMetadata(html, {
headingStyle: 'Atx',
}, {
extract_document: true,
extract_headers: true,
extract_links: true,
extract_images: true,
});
console.log(markdown);
console.log(`Found ${metadata.headers.length} headers`);
console.log(`Found ${metadata.links.length} links`);
Ruby Example
require 'html_to_markdown'
html = <<~HTML
<html>
<head>
<title>Documentation</title>
<meta name="description" content="API Reference">
</head>
<body>
<h2>Installation</h2>
<p>See our <a href="https://github.com">GitHub</a>.</p>
<img src="https://example.com/diagram.svg" alt="Architecture" width="960" height="540">
</body>
</html>
HTML
markdown, metadata = HtmlToMarkdown.convert_with_metadata(
html,
options: { heading_style: :atx },
metadata_config: {
extract_document: true,
extract_headers: true,
extract_links: true,
extract_images: true,
}
)
puts markdown
puts "Title: #{metadata[:document][:title]}"
puts "Images: #{metadata[:images].length}"
PHP Example
<?php
use HtmlToMarkdown\HtmlToMarkdown;
$html = <<<HTML
<html>
<head>
<title>Tutorial</title>
<meta name="description" content="Step-by-step guide">
</head>
<body>
<h1>Getting Started</h1>
<p>Check our <a href="https://example.com/guide">guide</a>.</p>
<img src="https://example.com/steps.png" alt="Steps" width="1024" height="768">
</body>
</html>
HTML;
[$markdown, $metadata] = convert_with_metadata(
$html,
options: ['heading_style' => 'Atx'],
metadataConfig: [
'extract_document' => true,
'extract_headers' => true,
'extract_links' => true,
'extract_images' => true,
]
);
echo "Title: " . $metadata['document']['title'] . "\n";
echo "Found " . count($metadata['links']) . " links\n";
Go Example
package main
import (
"encoding/json"
"fmt"
"log"
"github.com/Goldziher/html-to-markdown/packages/go/v2/htmltomarkdown"
)
func main() {
html := `
<html>
<head>
<title>Developer Guide</title>
<meta name="description" content="Complete API reference">
</head>
<body>
<h1>API Overview</h1>
<p>Learn more at our <a href="https://api.example.com/docs">API docs</a>.</p>
<img src="https://example.com/api-flow.png" alt="API Flow" width="1280" height="720">
</body>
</html>
`
markdown, metadata, err := htmltomarkdown.ConvertWithMetadata(html, &htmltomarkdown.MetadataConfig{
ExtractDocument: true,
ExtractHeaders: true,
ExtractLinks: true,
ExtractImages: true,
ExtractStructuredData: false,
})
if err != nil {
log.Fatal(err)
}
fmt.Println("Markdown:", markdown)
fmt.Printf("Title: %s\n", metadata.Document.Title)
fmt.Printf("Found %d links\n", len(metadata.Links))
// Marshal to JSON if needed
jsonBytes, _ := json.MarshalIndent(metadata, "", " ")
fmt.Println(string(jsonBytes))
}
Java Example
import io.github.goldziher.htmltomarkdown.HtmlToMarkdown;
import io.github.goldziher.htmltomarkdown.ConversionResult;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
public class MetadataExample {
public static void main(String[] args) {
String html = """
<html>
<head>
<title>Java Guide</title>
<meta name="description" content="Complete Java bindings documentation">
</head>
<body>
<h1>Quick Start</h1>
<p>Visit our <a href="https://github.com/Goldziher/html-to-markdown">GitHub</a>.</p>
<img src="https://example.com/java-flow.png" alt="Flow diagram" width="1024" height="576">
</body>
</html>
""";
try {
ConversionResult result = HtmlToMarkdown.convertWithMetadata(
html,
new HtmlToMarkdown.MetadataOptions()
.extractDocument(true)
.extractHeaders(true)
.extractLinks(true)
.extractImages(true)
);
System.out.println("Markdown:\n" + result.getMarkdown());
System.out.println("Title: " + result.getMetadata().getDocument().getTitle());
System.out.println("Links found: " + result.getMetadata().getLinks().size());
// Pretty-print metadata as JSON
Gson gson = new GsonBuilder().setPrettyPrinting().create();
System.out.println(gson.toJson(result.getMetadata()));
} catch (HtmlToMarkdown.ConversionException e) {
System.err.println("Conversion failed: " + e.getMessage());
}
}
}
C# Example
using HtmlToMarkdown;
using System.Text.Json;
var html = @"
<html>
<head>
<title>C# Guide</title>
<meta name=""description"" content=""Official C# bindings documentation"">
</head>
<body>
<h1>Introduction</h1>
<p>See our <a href=""https://github.com/Goldziher/html-to-markdown"">repository</a>.</p>
<img src=""https://example.com/csharp-arch.png"" alt=""Architecture"" width=""1200"" height=""675"">
</body>
</html>
";
try
{
var result = HtmlToMarkdownConverter.ConvertWithMetadata(
html,
new MetadataConfig
{
ExtractDocument = true,
ExtractHeaders = true,
ExtractLinks = true,
ExtractImages = true,
}
);
Console.WriteLine("Markdown:");
Console.WriteLine(result.Markdown);
Console.WriteLine($"Title: {result.Metadata.Document.Title}");
Console.WriteLine($"Links found: {result.Metadata.Links.Count}");
// Serialize metadata to JSON
var options = new JsonSerializerOptions { WriteIndented = true };
var json = JsonSerializer.Serialize(result.Metadata, options);
Console.WriteLine(json);
}
catch (HtmlToMarkdownException ex)
{
Console.Error.WriteLine($"Conversion failed: {ex.Message}");
}
See the individual binding READMEs for detailed metadata extraction options:
- Python โ Python README
- TypeScript/Node.js โ Node.js README | TypeScript README
- Ruby โ Ruby README
- PHP โ PHP README
- Go โ Go README
- Java โ Java README
- C#/.NET โ C# README
- WebAssembly โ WASM README
- Rust โ Rust README
Python (v2 API)
from html_to_markdown import convert, convert_with_inline_images, InlineImageConfig
html = "<h1>Hello</h1><p>Rust โค๏ธ Markdown</p>"
markdown = convert(html)
markdown, inline_images, warnings = convert_with_inline_images(
'<img src="data:image/png;base64,...==" alt="Pixel">',
image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)
Elixir
{:ok, markdown} = HtmlToMarkdown.convert("<h1>Hello</h1>")
# Keyword options are supported (internally mapped to the Rust ConversionOptions struct)
HtmlToMarkdown.convert!("<p>Wrap me</p>", wrap: true, wrap_width: 32, preprocessing: %{enabled: true})
Rust
use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};
let html = "<h1>Welcome</h1><p>Fast conversion</p>";
let markdown = convert(html, None)?;
let options = ConversionOptions {
heading_style: HeadingStyle::Atx,
..Default::default()
};
let markdown = convert(html, Some(options))?;
See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.
Performance
Benchmarked on Apple M4 using the shared fixture harness in tools/benchmark-harness (latest consolidated run: 20409971461).
Comparative Throughput (Median Across Fixtures)
| Runtime | Median ops/sec | Median throughput (MB/s) | Peak memory (MB) | Successes |
|---|---|---|---|---|
| Rust | 1,060.3 | 116.4 | 171.3 | 56/56 |
| Go | 1,496.3 | 131.1 | 22.9 | 16/16 |
| Ruby | 2,155.5 | 300.4 | 280.3 | 48/48 |
| PHP | 2,357.7 | 308.0 | 223.5 | 48/48 |
| Elixir | 1,564.1 | 269.1 | 384.7 | 48/48 |
| C# | 1,234.2 | 272.4 | 187.8 | 16/16 |
| Java | 1,298.7 | 167.1 | 527.2 | 16/16 |
| WASM | 1,485.8 | 157.6 | 95.3 | 48/48 |
| Node.js (NAPI) | 2,054.2 | 306.5 | 95.4 | 48/48 |
| Python (PyO3) | 3,120.3 | 307.5 | 83.5 | 48/48 |
Use task bench:harness to regenerate throughput numbers across the bindings, task bench:harness:memory for CPU/memory samples, and task bench:harness:rust for flamegraphs.
Compatibility (v1 โ v2)
Testing
Use the task runner to execute the entire matrix locally:
# All core test suites (Rust, Python, Ruby, Node, PHP, Go, C#, Elixir, Java)
task test
# Run the Wasmtime-backed WASM integration tests
task wasm:test:wasmtime
The Wasmtime suite builds the html-to-markdown-wasm artifact with the same flags used in CI and drives it through Wasmtime to ensure the non-JS runtime behaves exactly like the browser/Deno builds.
- V2โs Rust core sustains 150โ210โฏMB/s throughput; V1 averaged โโฏ2.5โฏMB/s in its Python/BeautifulSoup implementation (60โ80ร faster).
- The Python package offers a compatibility shim in
html_to_markdown.v1_compat(convert_to_markdown,convert_to_markdown_stream,markdownify). The shim is deprecated, emitsDeprecationWarningon every call, and will be removed in v3.0โplan migrations now. Details and keyword mappings live in Python README. - CLI flag changes, option renames, and other breaking updates are summarised in CHANGELOG.
Community
- Chat with us on Discord
- Explore the broader Kreuzberg document-processing ecosystem
- Sponsor development via GitHub Sponsors
Ruby
require 'html_to_markdown'
html = '<h1>Hello</h1><p>Rust โค๏ธ Markdown</p>'
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, wrap: true)
puts markdown
# # Hello
#
# Rust โค๏ธ Markdown
See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 2.16.1 | 192 | 12/22/2025 |
| 2.16.0 | 182 | 12/22/2025 |
| 2.15.0 | 260 | 12/19/2025 |
| 2.14.11 | 281 | 12/16/2025 |
| 2.14.10 | 270 | 12/16/2025 |
| 2.14.9 | 267 | 12/16/2025 |
| 2.14.8 | 281 | 12/15/2025 |
| 2.14.7 | 266 | 12/15/2025 |
| 2.14.6 | 250 | 12/15/2025 |
| 2.14.5 | 247 | 12/15/2025 |
| 2.14.4 | 249 | 12/15/2025 |
| 2.14.2 | 114 | 12/13/2025 |
| 2.14.1 | 132 | 12/12/2025 |
| 2.14.0 | 418 | 12/11/2025 |
| 2.13.0 | 424 | 12/10/2025 |
| 2.12.1 | 429 | 12/9/2025 |
| 2.12.0 | 445 | 12/8/2025 |
| 2.11.4 | 417 | 12/8/2025 |
| 2.11.3 | 410 | 12/8/2025 |
| 2.11.1 | 181 | 12/5/2025 |