Which Schema.org Properties AI Systems Extract Reliably

The Schema.org vocabulary contains hundreds of types and thousands of properties. Most are ignored by AI systems. Understanding extraction reality versus specification completeness prevents wasted implementation effort.

The extraction funnel starts with detection, not parsing. AI systems must first detect that structured data exists. JSON-LD in script tags is detectable through simple pattern matching. Microdata embedded in HTML requires DOM parsing to discover. RDFa requires attribute parsing. Detection difficulty affects whether systems attempt extraction. JSON-LD wins because detection is trivial.

The type recognition layer determines parsing approach. Schema.org defines type hierarchy, but parsers only implement handlers for common types. Article, Product, Organization, Person, LocalBusiness, HowTo, FAQ, and Recipe have dedicated handlers in most systems. Custom or obscure types may parse as generic Thing, losing type-specific property extraction. Use common types even if less precise types exist.

The property whitelist explains extraction selectivity. Parsers don’t extract all properties; they extract properties their handlers expect. An Organization handler might extract name, url, logo, sameAs, and address. It probably ignores foundingDate, numberOfEmployees, and makesOffer. Handlers were built for common use cases. Properties outside common patterns fail extraction even if correctly implemented.

The sameAs property achieves near-universal extraction because it solves a problem all AI systems have: entity disambiguation. Links to Wikidata, Wikipedia, official social profiles, and canonical references enable identity resolution. sameAs bridges your structured data to AI systems’ entity knowledge. This is among the highest-leverage single properties because it serves AI system needs directly.

The nesting depth limitation affects complex structures. Single-level nesting (Organization with address as PostalAddress) extracts reliably. Two-level nesting (Organization with member containing Person with affiliation) often fails. Each nesting level requires handler recursion that parsers may not implement. Flatten deep structures. Create separate top-level entities linked by identifier rather than nested structures.

The array handling inconsistency creates silent failures. Properties accepting arrays (multiple authors, multiple images) may: extract all items, extract only first item, extract none on detecting array. Different systems handle arrays differently. For critical properties, test array extraction specifically. Consider providing both array format and additional single-value property for redundancy.

The value type strictness varies by property. Some properties expect specific formats: dates in ISO 8601, URLs as absolute URIs, numbers as numeric types not strings. Incorrect formats may cause parsing failures or incorrect value extraction. A date formatted as “January 15, 2024” instead of “2024-01-15” may fail date parsing and extract as text or not at all.

The testing methodology for extraction requires probing AI responses. Implement structured data with known specific values. Query AI systems about those specific values. If responses include your structured data values, extraction succeeded. If responses show different values or hedged statements, extraction may have failed. This empirical testing beats validation tools that only check syntax.

The maintenance failure mode emerges from implementation success. Structured data implemented correctly at launch degrades as content changes without corresponding schema updates. The schema says CEO is “Jane Smith” while page says new CEO is “John Doe.” Contradictions create extraction uncertainty. Either maintain schema-content synchronization or implement schema generation from content rather than static markup.

The priority implementation order maximizes return on effort: first, identity properties (@type, name, url) for entity recognition; second, sameAs for disambiguation; third, domain-specific properties with known handler support; fourth, comprehensive properties if resources allow. Most AI visibility benefit comes from the first two tiers. Additional properties provide diminishing returns.

The future extraction expansion creates strategic consideration. As AI systems evolve, extraction may expand to currently-ignored properties. Implementing comprehensive structured data now positions for future extraction even if current benefit is limited. Balance current ROI with future optionality based on your resource constraints.

Which Schema.org Properties AI Systems Extract Reliably

Related posts: