FormatArc HTML to Markdown conversion result, useful for preparing LLM contextFormatArc HTML to Markdown conversion result, useful for preparing LLM context
Published: 2026-05-21

Markdown vs HTML for LLMs: Tokens, Accuracy, and No-Upload Conversion

Compare HTML and Markdown as input formats for ChatGPT, Claude, and Gemini. Measured size and token estimates, table extraction accuracy, and a browser-side workflow that keeps internal HTML off third-party servers.

Pasting a raw web page into ChatGPT, Claude, or Gemini works, but you almost always pay for it twice: in tokens, and in answer quality. The HTML you copied is full of wrapper divs, class attributes, inline scripts, and tracking pixels — none of which the model needs to understand the content. Markdown strips that out and leaves the model with just the structure and the text.

This guide compares Markdown and HTML as LLM input formats with measured numbers, lists the cases where HTML is still the right choice, and shows a browser-side conversion path that does not upload internal HTML to a third-party server.

Quick answer

For LLM input, prefer Markdown. It uses roughly one-third to one-tenth of the tokens of equivalent HTML, and external benchmarks show higher accuracy on tables, lists, and code blocks. Paste your HTML into HTML to Markdown and copy the result into your prompt. The conversion runs entirely in your browser — the HTML you paste is not uploaded to FormatArc or any third-party service.

How LLMs read formats

LLMs do not "see" rendered HTML the way a browser does. They process the raw source as a token stream. Every angle bracket, class name, and inline style consumes tokens from the same context window your actual content has to fit inside.

Two consequences follow:

  • A wrapper-heavy HTML page leaves less room for instructions, examples, and the assistant's reply.
  • Noise inside the source — class="text-base text-gray-700", data-* attributes, analytics snippets — can distract the model from the content it is supposed to extract or rewrite.

Markdown encodes the same structure (headings, lists, links, code) with one or two punctuation marks per element instead of opening and closing tags. The result is shorter and more consistent with the patterns LLMs see in their training corpora: GitHub README files, documentation sites, Stack Overflow posts, and forum threads are predominantly Markdown.

The exact training mix for each model is not public, so do not over-claim here. What is verifiable is that Markdown is widely used in public technical text, and major model providers explicitly recommend Markdown-like structure in their prompting guides — for example, Anthropic's Claude prompting docs and Google's Gemini prompt-design guide both suggest using headings and bullet lists to separate sections.

Token efficiency: a measured comparison

To compare formats fairly, take a small but realistic technical document — a "What is JSON?" introduction with a paragraph, a heading, a bullet list, a code block, and a three-column table — and express it three ways. The HTML version uses the wrapper pattern you would see on a typical CMS page. The Markdown version uses CommonMark plus the GFM table extension. The plain-text version is what you get from "Copy as plain text" in your browser.

Format Characters (UTF-8) Bytes OpenAI-tokenizer estimate
HTML (rendered DOM with classes and aria) 1,389 1,389 ~348
Markdown (GFM) 791 791 ~198
Plain text (tags stripped) 638 638 ~160

Reduction vs HTML: Markdown -43% chars, plain text -54% chars. The token estimates use the OpenAI rule-of-thumb that English averages around four characters per token; Claude and Gemini tokenizers differ in detail but the ratios stay in the same range.

The web is messier than this sample. External measurements on real pages report sharper gaps:

  • Web2MD's controlled test on a 500-word article: HTML ~2,100 tokens vs Markdown ~700 tokens — a 3x reduction.
  • Beam.ai's 2026 write-up: 68% token reduction on clean content, up to 87% on real-world pages with full DOM scaffolding.
  • ReleasePad's analysis: 10–20% reduction on simple bullet lists, scaling sharply as wrapper density increases.

Whichever benchmark you trust, the direction is the same: HTML pays a wrapper tax that Markdown avoids, and the tax compounds for context-heavy prompts where you want to fit several documents into one window.

Privacy: keep sensitive HTML out of third-party services

The token and accuracy arguments are well-rehearsed in other write-ups. The angle most of them skip is what happens to the source HTML during the conversion step.

Many online "HTML to Markdown" converters run the conversion on a backend server. You paste your HTML, the page POSTs it to an API, the server returns Markdown. That is fine for a public Wikipedia snippet. It is not fine for:

  • An internal documentation page exported from Confluence or Notion.
  • An admin dashboard's HTML dump that contains customer names.
  • A staging-site response body before the marketing copy is approved.
  • Any HTML email body with personal data.

FormatArc's HTML to Markdown is a static page. The Markdown conversion is implemented in JavaScript that runs in your browser, using the Turndown library bundled with the page. The HTML you paste is parsed locally; no network request carries the source HTML to FormatArc or to any third-party service. You can verify this yourself by opening the browser DevTools network tab, pasting a uniquely identifiable string into the HTML pane, pressing Run, and confirming that no outbound request contains your string.

"Browser-side" here means the source HTML is not uploaded. The page itself is still served over HTTPS from a CDN, and standard analytics may be loaded once on first visit — but the document you convert never leaves your machine.

Comprehension quality: tables, code, and lists

Token count is the easy half. Comprehension quality is the half that decides whether your prompt actually works.

Published benchmarks tend to favor Markdown on three common tasks:

  • Table extraction: a frequently cited GPT-based evaluation reports ~60.7% accuracy on Markdown tables vs ~53.6% on equivalent HTML tables, a 7-point gap on identical underlying data.
  • Code block handling: Markdown fenced code with a language hint (```python) preserves the language signal cleanly; HTML often nests the language hint inside a class attribute (<pre><code class="language-python">) which the model has to parse out of the markup.
  • Nested lists: Markdown indentation gives a strong structural cue with low token cost. HTML <ul><li><ul><li> chains burn tokens and occasionally trip the model on which list a child item belongs to.

None of this means Markdown is universally more accurate — see the next section for where HTML wins — but for the everyday "summarize this article", "extract these fields", or "rewrite this section" pattern, the comprehension data points the same direction as the token data.

Where HTML is still the right choice

Markdown is not always the answer. There are three scenarios where pasting HTML is genuinely better:

When semantics live in the attributes

aria-label, role, itemprop, microdata, and Open Graph tags carry information that has no Markdown equivalent. If you are asking the model to audit accessibility, extract structured product metadata, or check schema.org markup, the HTML attributes are the content. Stripping them with a Markdown converter destroys the task.

When you need the visual layout, not just the text

SVG diagrams, embedded charts, <iframe> widgets, custom data attributes for interactive components — these survive in HTML and disappear in Markdown. Recent commentary from Anthropic engineers has argued that for AI agents producing rich human-facing output, HTML's expressive range (styled layouts, interactive elements, embedded SVG) is worth its higher token cost. The point applies symmetrically: if your input contains visuals the model needs to reason about, send HTML.

When the model is going to render the result

If you are asking the model to produce output that will be rendered back into a browser, sometimes it is simpler to keep HTML end-to-end and skip the Markdown intermediate. This is mostly a tooling question — Markdown round-trips well enough that it rarely forces a decision.

A practical workflow: HTML to LLM-ready Markdown

For the common case — you have a web page or HTML email and you want to give an LLM the content without the markup tax — here is the workflow that keeps the data local.

Step 1: Get the HTML

In Chrome or Firefox, right-click the page and choose "View page source", or use the DevTools Elements panel and copy the outer HTML of the <article> or <main> element. For HTML email, "View source" in your mail client.

If you only need the article body, copy that subtree rather than the whole page. Removing the navigation, sidebar, and footer at this stage cuts the token budget more than any clever conversion later.

Step 2: Convert in your browser

Paste into HTML to Markdown and press Run. The Markdown appears in the right pane.

HTML to Markdown conversion result, used for preparing LLM contextHTML to Markdown conversion result, used for preparing LLM context

For a deeper walkthrough of the conversion itself — tables, image paths, colspan edge cases — see the HTML to Markdown guide. For the reverse direction when the LLM responds with Markdown that you want back as HTML, Markdown to HTML handles it locally too.

Step 3: Trim before pasting

Even after conversion, scan the Markdown for residue you do not need:

  • Navigation links that converted into list items at the top.
  • Cookie-consent banners that still left a paragraph behind.
  • Footer copyright blocks.

Two minutes of manual trimming usually buys back more context than any further automated step.

Step 4: Prompt with the cleaned Markdown

A simple template that works for most extraction tasks:

Below is a documentation page in Markdown.

Task: <one sentence>.
Constraints: <output format, length, etc.>.

---

<paste the cleaned Markdown here>

The Markdown headings give the model strong anchors to reference in its reply ("In the 'Syntax' section..."), which improves answer specificity.

Pitfalls when converting HTML for LLM input

Five things tend to go wrong. Watch for these:

  1. Code blocks losing their language hint. <pre><code class="language-python"> should become ```python. Some converters drop the hint, which forces the model to guess the language.
  2. Tables with colspan or rowspan collapsing. GFM pipe tables are strictly rectangular, so merged cells get flattened. For data tables, consider converting via CSV to Markdown instead. See also the Markdown table syntax cheatsheet.
  3. Inline HTML leaking through. CommonMark and GFM both allow raw HTML inline. If the model is going to see <span class="text-red">important</span> in your "Markdown", that is back in the wrapper-tax bucket. Use a converter that emits pure Markdown for the cases it can, and keep raw HTML only for the constructs (mathematical notation, complex tables) that genuinely need it.
  4. Relative image and link paths. <img src="/images/foo.png"> becomes ![](/images/foo.png), which the LLM cannot fetch. Either rewrite the paths to absolute URLs or note in the prompt that images are unavailable.
  5. CommonMark vs GFM mismatch. Tables, task lists, strikethrough, and autolinks are GFM extensions. If your downstream tooling is strict CommonMark, those features will not render. See CommonMark vs GFM for the boundary.

Format comparison at a glance

For the impatient, here is the decision matrix:

Format When to use as LLM input Token cost Strength Weakness
Markdown Default for most prompts: docs, articles, READMEs, chat logs Low Structural cues match training data; tables, lists, code preserved Loses attribute semantics, no inline styling
Plain text Pure text extraction, OCR-like tasks Lowest Smallest footprint Structure is gone; bad for lists or tables
HTML Accessibility audits, schema.org / microdata, visual layout reasoning Highest Carries attributes, semantics, embedded media Wrapper tax; noise distracts the model
JSON Structured records, API responses, function-call payloads Medium Unambiguous schema; the model can pattern-match keys Verbose for prose; quoting overhead
XML Anthropic recommends XML tags for prompt sections in Claude Medium Explicit boundaries between prompt parts Verbose; CommonMark structure is usually sufficient

For most everyday prompts — "summarize this article", "extract these fields", "rewrite in plain English" — Markdown is the right default.

Frequently asked questions

Should I use Markdown or plain text for ChatGPT context?

Markdown if the source has any structure (headings, lists, tables, code). Plain text if it is genuinely flat prose. Plain text is cheapest but discards the structural cues that help the model navigate longer contexts.

Does Claude understand Markdown better than HTML?

Claude handles both. Anthropic's prompting guidance recommends Markdown-style headings and lists to separate prompt sections, and additionally encourages XML tags (<instructions>, <context>) as boundaries between prompt parts. Markdown still wins on token efficiency for the content itself; XML is helpful around the content as scaffolding.

What about JSON or XML for structured context?

Use JSON when the data is naturally tabular or record-shaped (API responses, configuration). Use XML when you want explicit boundaries between sections of the prompt — Anthropic's docs use this style. For prose, neither beats Markdown on token cost.

How do I convert a URL directly to LLM-ready Markdown?

There is no fully client-side way to fetch an arbitrary URL from a static page (CORS blocks it). Save the page locally first (Cmd/Ctrl-S, or copy the source from DevTools) and paste the HTML into HTML to Markdown. The conversion itself stays in your browser.

Is FormatArc's conversion really browser-only?

Yes for the conversion step. The HTML you paste is parsed by the Turndown JavaScript library bundled with the page, and no request is sent containing the source. The page itself loads from a CDN over HTTPS and may make standard analytics calls on first visit, but your pasted HTML is not part of any outbound request.

Wrapping up

For LLM context, Markdown beats HTML on both axes that matter: it costs fewer tokens, and it lets the model focus on content rather than scaffolding. The exact ratio depends on the source page, but the direction is consistent across published benchmarks.

If the source HTML is sensitive — internal docs, customer data, unpublished drafts — the conversion step itself matters. HTML to Markdown runs locally in your browser, so the source never reaches a third-party server.

For the reverse direction, see Markdown to HTML and the Markdown to HTML guide. For specific issues with tables in your converted Markdown, the Markdown table syntax cheatsheet covers escaping, alignment, and pitfalls.