Benchmark Series Summary: GEO Audit Accuracy Across Ahrefs, HubSpot, and Stripe
This is a 3-site benchmark series, not universal proof across the web. It exists so teams can evaluate whether the workflow is credible enough to use on their own domains, compare against other tooling, and understand the caveats before they move into the product flow.
The same bounded deep-audit workflow, evidence capture, Codex verification, adjudication layer, and human review standard were applied across Ahrefs, HubSpot, and Stripe.
For technical buyers and advanced marketers, this functions as benchmark proof in public: strict precision, weighted precision, and operational FP share are all shown together so the series does not cherry-pick the flattering number.
These public benchmark pages and samples are publishing artifacts, not anonymized customer stories. They exist so buyers can inspect the output quality before more rollout proof is published.
Strict precision
0.594
Fully correct findings only: 60 true positives across 101 scored findings.
Weighted precision
0.757
True positives plus half credit for partials: 76.5 weighted-correct findings out of 101.
Operational FP share
0.079
Outright wrong findings among scored findings: 8 of 101.
Scored findings
101
Aggregate denominator after preserving seven exclusions outside fair scoring.
What this means for your site
Want to benchmark your own site?
Turn curiosity from the benchmark into a real audit on your own site. The same workflow gives your team a scored report, linked evidence, and a prioritized implementation handoff you can actually ship.
Your site will be audited using the same methodology shown here; it is not scored as a direct side-by-side contest against these brands.
Unfamiliar with the methodology? Read what an AI visibility audit actually measures.
Executive Summary
Across the first three benchmark sites, the GEO audit system was generally credible and useful. The bigger quality issue was not rampant wrongness. It was partial findings: outputs that captured a real issue direction but overstated scope, certainty, or current live applicability.
The operational false-positive share remained low on all three sites. Deterministic findings such as metadata checks, stable schema absence, and preserved performance evidence held up better than interpretation-heavy findings that depended on excerpt completeness or wording.
Sites Included
- Ahrefs: A large public SEO software site with product, editorial, and comparison surfaces that stress deterministic checks and evidence traceability.
- HubSpot: A schema-drift-heavy SaaS benchmark that exposes where stored artifacts and current live markup no longer line up cleanly.
- Stripe: A mixed product, pricing, support, newsroom, and customer-story benchmark that adds crawl-graph artifacts and newsroom schema drift.
Methodology
- Deep audit: each site used the deep audit package with a bounded crawl rather than an unbounded sweep of the entire domain.
- Evidence capture: the pipeline preserved screenshots, extracted content, DB exports, report artifacts, and evidence packages for later review.
- Codex verification: emitted findings were checked against stored artifacts plus current live spot checks where useful.
- Adjudication layer: ambiguous findings were classified conservatively as TP, FP, Partial, or Excluded rather than being rewritten into certainty.
- Human sign-off where needed: interpretation-heavy findings and unresolved contradictions remained visible in the final publication output.
- Exclusions preserved: query-signal stale exclusions and other pipeline-level caveats were excluded from scoring instead of being buried.
Per-Site Results
Aggregate Results
- Total findings
- 108
- Scored findings
- 101
- TP / FP / Partial / Excluded
- 60 / 8 / 33 / 7
- Aggregate metrics
- Strict 0.594 | Weighted 0.757 | FP share 0.079
What the Results Mean
- The system is generally credible and useful across this benchmark series, especially when findings are grounded in deterministic evidence capture.
- False-positive share stayed low across all three sites, which matters operationally because outright wrong findings were limited relative to scored output volume.
- The bigger quality issue is partial findings, not rampant wrong findings. Many outputs had a real factual core but overstated scope, certainty, or live applicability.
- Deterministic findings such as metadata, stable schema absence, and preserved performance evidence were more reliable than interpretation-heavy findings.
Cross-Site Patterns
- stored/live schema contradictions occurred across sites because the pipeline does not preserve runtime HTML snapshots.
- Query-signal stale exclusions are systemic pipeline-level caveats and were excluded from scoring.
- Content-heavy sites produced more Partial findings because interpretation and excerpt completeness mattered more.
- Operational FP share stayed low across all three sites.
- Strict precision varied more than weighted precision because ambiguity drove Partial classifications rather than pure false positives.
Limitations
- This is a 3-site benchmark series, not universal proof across the web.
- Operational FP share is not a classical false-positive rate because there is no enumerated true-negative universe.
- Runtime HTML was not preserved, which is why stored/live schema contradictions remain unresolved in some cases.
- Pipeline-level exclusions such as stale query-signal rows were excluded from scoring rather than normalized away.
- The series is strong enough to publish as a benchmark program, but not a basis for universal claims about all sites or all audit tools.
Links to Individual Benchmarks
Ahrefs
A large public SEO software site with product, editorial, and comparison surfaces that stress deterministic checks and evidence traceability.
Read the benchmarkHubSpot
A schema-drift-heavy SaaS benchmark that exposes where stored artifacts and current live markup no longer line up cleanly.
Read the benchmarkStripe
A mixed product, pricing, support, newsroom, and customer-story benchmark that adds crawl-graph artifacts and newsroom schema drift.
Read the benchmarkGEO Scores
In addition to accuracy validation, each audit produced a GEO Score measuring overall AI visibility readiness. The full GEO score analysis, signal comparison, and cross-site findings are available in the research study.
| Site | GEO Score |
|---|---|
| Stripe | 81 |
| Ahrefs | 77 |
| HubSpot | 76 |
Conclusion
This benchmark series is strong enough to publish as a benchmark program and as a product proof asset for evaluators who want evidence instead of narrative alone.
It is not yet a basis for universal claims about all sites, all audit categories, or all SEO and GEO workflows.