Structured Data for LLMs

Last updated:

Large language models are becoming a primary way people discover and consume information. Structured data, originally designed for search engines, turns out to be equally valuable for AI systems. This page explains how LLMs interact with structured data and what you can do to make your content more useful to them.

How LLMs Encounter Structured Data

LLMs encounter structured data in three main ways.

Training Data

Models trained on web crawls (Common Crawl, etc.) ingest billions of pages that contain JSON-LD, Microdata, and RDFa. During training, models learn the patterns and semantics of schema.org types. They learn that a Product has a name, price, and offers. They learn that an Article has a headline, author, and datePublished.

This means LLMs already understand the schema.org vocabulary. When they encounter it, they can extract meaning more reliably than from unstructured prose alone.

RAG Pipelines

Retrieval-augmented generation (RAG) systems fetch external documents at inference time and feed them into the model’s context window. When those documents contain structured data, the model can extract facts with higher precision.

Web Scraping and Tool Use

AI agents that browse the web parse HTML directly. JSON-LD provides a clean, machine-readable summary of a page’s content. An agent looking for product prices does not need to parse layout-dependent HTML tables. It reads the Offer object.

What Structured Data Tells an LLM

Unstructured text requires interpretation. Structured data provides explicit declarations.

Consider this prose: “The Ergonomic Desk Chair by ErgoSit costs $499.99 and is currently in stock.”

An LLM can probably extract the product name, brand, price, and availability from that sentence. But “probably” is doing a lot of work. What if the sentence is more ambiguous? What if the price is mentioned three paragraphs away from the product name?

Now consider the equivalent JSON-LD:

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Ergonomic Desk Chair",
  "brand": { "@type": "Brand", "name": "ErgoSit" },
  "offers": {
    "@type": "Offer",
    "price": 499.99,
    "priceCurrency": "USD",
    "availability": "https://schema.org/InStock"
  }
}

There is no ambiguity. The product name is "Ergonomic Desk Chair". The price is 499.99 in USD. Availability is InStock. An LLM can extract these facts with certainty.

Structured data tells an LLM:

  • What type of thing is being described (Product, Person, Event, Article).
  • What properties that thing has, with explicit labels.
  • How entities relate to each other (the author of the article, the brand of the product).
  • Factual claims with precise values (dates, prices, ratings).

JSON-LD vs Prose: How Models Process Each

LLMs process text as sequences of tokens. Both JSON-LD and prose are tokenized and processed the same way at a mechanical level. But the information density and clarity differ significantly.

Prose is optimized for human reading. It uses pronouns, context-dependent references, and implicit knowledge. “It’s on sale until Friday” requires the reader to know what “it” refers to and which Friday.

JSON-LD is optimized for machine reading. Every fact is explicit. Every relationship is labeled. There are no pronouns, no implicit references, no ambiguity.

When an LLM has both prose and JSON-LD on a page, the structured data acts as a reliable anchor. The model can use it to:

  • Confirm facts mentioned in the prose.
  • Resolve ambiguities in natural language.
  • Extract specific data points without parsing complex sentences.
  • Build a knowledge graph of entities on the page.

Structured Data in RAG Pipelines

RAG pipelines typically work in three steps: retrieve relevant documents, extract content, and feed it to the model as context. Structured data improves each step.

Better Retrieval

When you index pages for retrieval, you can extract structured data fields (title, description, type) as metadata. This lets you filter and rank results by entity type. A query about a product can prioritize pages with Product structured data.

Cleaner Extraction

Instead of extracting all visible text from a page (which includes navigation, footers, ads, and boilerplate), you can extract the JSON-LD block. This gives you a clean, focused representation of the page’s primary content.

More Accurate Answers

When the model receives structured data in its context window, it can answer factual questions with higher accuracy. “What is the price?” becomes a lookup rather than an inference.

Example of a structured context chunk for a RAG system:

{
  "@type": "LocalBusiness",
  "name": "Downtown Coffee Roasters",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "456 Main Street",
    "addressLocality": "Portland",
    "addressRegion": "OR",
    "postalCode": "97201"
  },
  "telephone": "+1-503-555-0142",
  "openingHours": ["Mo-Fr 06:00-18:00", "Sa-Su 07:00-16:00"],
  "priceRange": "$$"
}

A model given this chunk can answer “What are the weekend hours?” without guessing.

Making Your Content AI-Readable

You do not need a separate strategy for AI. Good structured data practices serve both search engines and LLMs. Here are practical steps.

Use JSON-LD on Every Page

At minimum, add JSON-LD for the primary entity on each page. An article page gets Article. A product page gets Product. A business listing gets LocalBusiness.

Be Comprehensive

Do not stop at required properties. Add every relevant property you have data for. The more properties you include, the more facts an LLM can extract.

Use Specific Types

Schema.org has a deep type hierarchy. Use the most specific type available. SoftwareApplication is better than CreativeWork. MedicalClinic is better than LocalBusiness. Specific types carry more semantic meaning.

Keep Data Consistent

The facts in your JSON-LD must match the facts on the visible page. If your JSON-LD says the price is $499 but the page shows $549, you create conflicting signals. LLMs and search engines may flag or ignore your structured data.

Maintain Freshness

Update your dateModified properties when content changes. LLMs and AI systems increasingly consider recency. Stale dates signal stale content.

Schema.org as a Shared Ontology for AI

Schema.org defines over 800 types and thousands of properties. It was created by Google, Microsoft, Yahoo, and Yandex in 2011 to standardize structured data for search. But it has become something broader: a shared vocabulary for describing things on the web.

This matters for AI because:

  • LLMs already know it. Models trained on web data have seen millions of schema.org examples. They understand the types and properties natively.
  • It is domain-agnostic. Schema.org covers products, articles, events, people, places, medical entities, recipes, and more. It provides a common language across industries.
  • It is extensible. You can use schema.org types as a foundation and add domain-specific extensions when needed.
  • It provides relationships. Schema.org does not just describe individual entities. It describes how entities relate to each other. An Article has an author (a Person) who works for an Organization. These relationships help AI systems build knowledge graphs.

When you use schema.org, you are not just optimizing for Google. You are making your content interpretable by any system that understands this shared vocabulary.

Beyond Search: Agents, Actions, and Automation

Structured data is evolving beyond passive description. Schema.org defines Action types that describe operations a system can perform.

{
  "@context": "https://schema.org",
  "@type": "OrderAction",
  "target": {
    "@type": "EntryPoint",
    "urlTemplate": "https://example.com/api/order?product={productId}",
    "httpMethod": "POST"
  },
  "object": {
    "@type": "Product",
    "name": "Ergonomic Desk Chair"
  }
}

AI agents that can take actions on the web will look for exactly this kind of markup. Instead of trying to figure out which button to click, an agent can read the Action type and know the exact API endpoint to call.

This is still early. But the direction is clear. Structured data will be how websites advertise their capabilities to AI agents. The sites that have clean, comprehensive structured data will be the ones AI agents can interact with reliably.

The Convergence of SEO and AI Readability

Traditionally, structured data was an SEO concern. You added it to get rich results in Google. The audience was search engine crawlers.

That audience is expanding. LLMs, AI agents, voice assistants, and RAG systems all benefit from the same structured data. The investment you make in schema.org markup pays dividends across all of these channels.

The practical implication is simple. If you are already doing structured data for SEO, you are already doing it for AI. The best practices are the same:

  • Use the correct types.
  • Include comprehensive properties.
  • Keep data accurate and up to date.
  • Validate regularly.

The difference is that AI systems are less forgiving of inconsistencies between your structured data and your visible content. A search engine might still show a rich result even if your markup has minor issues. An LLM that ingests contradictory signals may produce incorrect answers and attribute them to your site.

Accuracy is not optional. It is the foundation of being useful to both humans and machines.