Case Study 3Evidence-backed. AI-adjudicated. Human-reviewed.

Case Study: SEO Audit of Stripe.com

A real deep-audit benchmark on Stripe that preserves the Ahrefs and HubSpot validation method, keeps schema drift and crawl artifacts visible, and publishes the final adjudicated dataset instead of just the report narrative.

This page doubles as a real SEO audit example based on a deep audit of Stripe. It remains a benchmark case study first: the point is to show what the pipeline found, what held up under adjudication, and where the result stays conservative.

If you are looking for an SEO audit report example, this benchmark shows the methodology, findings, evidence trail, and the limitations that keep the final read publication-safe.

Publishing note

These public benchmark pages and samples are publishing artifacts, not anonymized customer stories. They exist so buyers can inspect the output quality before more rollout proof is published.

View evidence appendix See all case studies

What This Means for Your Company

Stripe scored highest in the benchmark at 81, yet still has zero structured data — demonstrating that even best-in-class technical infrastructure cannot compensate for missing schema markup.

The metadata clarity score of 40 is the lowest across all three benchmarked sites, reducing how accurately AI engines can index and extract information from Stripe's pages.

Missing publication dates on newsroom content reduces credibility as a citable source for time-sensitive queries — a low-effort, high-impact fix.

GEO Score

81

Overall AI visibility readiness out of 100.

Strict precision

0.588

Fully correct findings only: 20 correct out of 34 scored findings.

Weighted precision

0.765

Fully correct findings plus half credit for partials: 26 weighted correct out of 34.

Operational FP share

0.059

Outright wrong findings among scored findings: 2 of 34.

GEO Dimension Scores

Dimension	Score	Interpretation
Authority Signals	88	Strong brand presence, internal linking, and Core Web Vitals
Entity Clarity	82	Brand is well-recognized but lacks formal entity typing
Answerability	80	Content answers most query types effectively
Extractability Index	78	Content parseable but some pages have excessive UI elements
Citation Readiness	75	Cited frequently but missing schema reinforcement

GEO Audit Key Findings

Highest GEO Score in the benchmark series at 81 — strongest overall generative-engine readiness
Zero structured data markup across all 30 crawled pages — same gap as Ahrefs and HubSpot
Metadata poorly optimized at 40/100 — title tags exceed recommended lengths, lack keyword specificity
Technical performance strong with three perfect 100 scores on technical signals
Newsroom content lacks freshness markers — no visible publication dates or author bylines

See the full benchmark study →

What this means for your site

See how your site compares to Stripe

Turn curiosity from the benchmark into a real audit on your own site. The same workflow gives your team a scored report, linked evidence, and a prioritized implementation handoff you can actually ship.

Analyze your site free Request demo View sample report Explore sample evidence

Executive Summary

This benchmark asked a narrow question: when the Citemeter GEO audit pipeline emits a finding on Stripe, how often is that finding correct enough to trust? The answer on this run is materially stronger than HubSpot, but still more cautious than Ahrefs because the Stripe audit surfaced both schema drift and a crawl-graph artifact.

The authoritative run was a deep audit against stripe.com, using the real app, local screenshot worker, bounded crawl, screenshots, LLM analysis, simulated query testing, report generation, evidence packaging, verification, adjudication, and final dataset merge.

Why This Benchmark Matters

Stripe combines product, pricing, support, newsroom, customer-story, and enterprise-marketing surfaces on one public site. That makes it a useful stress test for crawl fidelity, evidence traceability, structured-data detection, and interpretation-heavy LLM findings.

The Stripe benchmark also matters because it preserves uncomfortable evidence. The final result documents both the things the pipeline got right and the places where stored run-time evidence no longer matches the current live site.

Test Setup

The target was stripe.com under the deep audit package, using the same app-driven audit, screenshot, evidence, verification, adjudication, and publish flow used in the Ahrefs and HubSpot benchmarks.
The authoritative benchmark run used a bounded crawl of 30 pages, 12 simulated queries, and 5 stored screenshots.
Artifacts include the report PDF, evidence ZIP, storage downloads, DB exports, verification outputs, adjudication outputs, and final adjudicated dataset staged under the Stripe case-study package.
Stripe is a harder benchmark than Ahrefs because it produced both stored/live schema drift and a seed-page orphan-page artifact in the same run.

Validation Workflow

Audit pipeline: the system crawled the site, captured screenshots, extracted content, ran analysis, simulated query responses, and stored the resulting artifacts.
Codex verification: all reviewable findings were checked against stored exports, extracted evidence artifacts, perf JSON, and current live page source.
Format-compatible adjudication layer: the benchmark preserves the Ahrefs and HubSpot adjudication artifact format and review standard, while explicitly noting that the repository does not expose a standalone repo-native Claude runner.
Manual sign-off queue: stored/live schema contradictions and interpretation-heavy additional findings were retained explicitly rather than being flattened into certainty.

The repository does not expose a standalone repo-native Claude runner, so the adjudication layer was produced in the same adjudication artifact family and review standard used for the earlier benchmarks.

Final Findings Overview

Classification	Count	Meaning
TP	20	Fully correct findings where the emitted issue still matched the stored and current evidence.
FP	2	Findings that did not survive review, most clearly around the locale-home orphan claim and page-differentiation claim.
Partial	12	Findings with a real factual core that overstated scope, certainty, or current live applicability.
Excluded	3	Findings removed from scoring because they describe pipeline artifacts or audit-coverage limits rather than fair site defects.

The highest-confidence confirmations were deterministic metadata issues, the absence of FAQPage schema across the crawled sample, stable no-schema pages, and preserved mobile perf measurements. The biggest misses were the seeded-homepage orphan claim and one support-page positioning claim.

Metrics

Strict precision = 0.588. This counts only fully correct findings and preserves conservative partials where run-time schema evidence no longer matches the current live site.

Weighted precision = 0.765. This gives half credit to partial findings, which better reflects directionally useful outputs that overstated scope or certainty.

Operational FP share = 0.059. This is the most decision-useful wrongness measure available in a single-site benchmark.

SEO Audit of Stripe.com

The target was stripe.com under the deep audit package for the authoritative benchmark run, using a bounded same-origin crawl rather than an unbounded sweep of the whole property.
Coverage included page capture, screenshots, metadata and structured-data checks, LLM analysis, simulated query testing, packaged report output, and preserved evidence exports.
Stripe is a strong benchmark because it mixes pricing, support, newsroom, customer-story, and enterprise-marketing surfaces that can expose weak heuristics and weak evidence chains.
This is still a bounded deep audit of one public site and one authoritative run, not a universal claim about every Stripe page or about how every technical SEO audit should score.

What Our Audit Found

Metadata-length findings remained highly stable and still match the current live source exactly.
The strongest schema confirmations were on pricing, sales-contact, and customer-story pages that still expose no JSON-LD in the current live source.
The homepage and four newsroom pages now expose Organization or NewsArticle JSON-LD, so those findings remain conservative partials rather than being forced into cleaner categories.
The sales-contact page is clearly long and qualification-heavy, but the stronger conversion-friction conclusion remains interpretive.
Customer-story freshness remains a credible weakness in the sample because the checked live pages still did not surface visible publication dates or bylines.

Why This Works as an SEO Audit Example

It uses a real public site rather than a toy sample, so the findings and caveats look like the tradeoffs teams actually face in production audits.
It preserves the evidence chain from page and screenshot records through findings, recommendations, and packaged report artifacts.
It reports TP, FP, Partial, and Excluded outcomes instead of flattening the benchmark into a marketing-only win.
It keeps the uncomfortable details visible, including schema drift and the seeded-homepage orphan-page artifact.

What an Audit Report Should Include

Scope: define the target, audit package, crawl boundaries, and what the report does not claim to cover.
Evidence: show the screenshots, exports, and traceability chain behind the findings.
Prioritization: separate fully supported issues from partials, false positives, and exclusions.
Methodology: explain how the audit ran, how findings were reviewed, and how adjudication changed the final read.
Limitations: keep stale-query, schema-drift, and crawl-graph caveats visible.
Next actions: tell the reader what to validate next and what could still move the benchmark.

You can compare this case study against the sample report and sample evidence index to see how the public narrative and the artifact package fit together.

What This Means for Users

The product was most trustworthy on deterministic metadata, schema, and perf-backed findings.
It was weaker on broad recommendation bundles and interpretation-heavy copy judgments.
Stripe is now fully publishable in the app, but only with the stored/live schema caveats and seed-page orphan artifact preserved in the narrative.

Key Patterns from the Benchmark

Deterministic metadata checks remained highly stable on Stripe.
Stored/live schema drift affected the homepage and newsroom pages, not the pricing, sales-contact, or customer-story pages.
The LLM-mediated layer was directionally useful, but recommendation bundles and copy judgments needed more narrowing than the deterministic layer.
The evidence chain stayed strong enough to make both the correct findings and the contradictions legible.

Limitations

Query simulation remains a simulated AI-answer surface, not a live-engine citation measurement.
signal:ai_citation_frequency:1 was excluded because query rows existed while the signal snapshot still reported total_queries = 0.
signal:internal_link_authority:1 appears to be a seed-page crawl-graph artifact, not a true orphan-page defect.
Homepage and newsroom schema findings remain provisional because current live pages now expose JSON-LD that the stored run did not capture.
This is one benchmark site and one bounded run. It does not prove cross-site generalization.

The main unresolved caution is stored/live schema contradiction: the current locale homepage and multiple newsroom pages now expose JSON-LD that the stored benchmark run recorded as absent. A separate seed-page crawl artifact also remains visible: the locale homepage was once flagged as orphaned even though it was the seeded entry page for the crawl.

What Could Change These Results

Re-running Stripe after fixing query-signal ordering would remove one standing exclusion.
Preserving run-time HTML snapshots would allow the homepage and newsroom schema findings to move from conservative partials into cleaner categories.
A future crawl-graph fix could prevent seed pages from being misclassified as orphaned.
Narrower additional-finding prompts could reduce interpretation-heavy partials.

Evidence Appendix

The technical appendix is preserved in the repo as docs/case-studies/stripe-accuracy-validation-evidence.md. The buyer-facing page keeps the proof chain legible without exposing internal run identifiers or storage paths inline.

The strongest reviewer-facing chain in this benchmark runs from stored page and signal rows, to the downloaded evidence archive and screenshots, to verification outputs, to the adjudication record, and finally to the merged final dataset.

Technical validation details

Internal validation materials retain the authoritative run identifiers, exported datasets, packaged downloads, and reviewer workpapers so the benchmark can be replayed without exposing raw engineering artifacts on the buyer-facing page.

Run scope

App-driven deep audit with bounded crawl coverage, screenshots, extracted content, simulated queries, and packaged deliverables.

Evidence retention

Internal validation materials keep the run identifiers, exported datasets, and packaged downloads together for reviewer replay.

Review chain

Verification notes, adjudication decisions, and final scoring outputs were preserved before publication.

Conclusion

The publication-safe conclusion is that the Stripe benchmark supports moderate credibility on this site and run. The pipeline remained strongest on deterministic metadata, schema, and perf-backed findings, and materially weaker on broad recommendation bundles, strategy phrasing, and outputs affected by stored/live schema drift.

Read the full AI Visibility Benchmark study →