BrandRadar.org - How do I find where AI is getting information about my brand?

Short answer: you can’t always discover the exact documents a closed LLM was trained on, but you can (1) audit everything public about your brand, (2) monitor & search the web where models harvest data, (3) probe models and retrieval systems to see what sources they cite or retrieve, and (4) use contractual, technical and takedown options to limit or remove content. Below is a practical, ordered plan you can run right now.

Do an immediate inventory of public content

Crawl and index your own properties (website, blogs, docs, help center, product pages, press releases). Export the text.
Collect third‑party references: partner pages, distributors, affiliates, guest posts, press coverage.
Gather social posts, forum threads, reviews, FAQ entries, job listings, public spreadsheets, and images.

Search the web where models get training data

Use search engine queries to find mentions:
- site:example.com "YourBrand"
- "YourBrand" review
- "YourBrand" leak OR "YourBrand" password OR "YourBrand" API key
Check:
- News sites, blogs, aggregator sites
- Forums and Q&A (Reddit, StackExchange, product forums)
- Review sites (Trustpilot, G2, Yelp)
- Public code repositories and paste sites (GitHub, GitLab, Pastebin)
- Web archives (Wayback Machine)
Set up continuous alerts: Google Alerts, Talkwalker Alerts, Mention.

Use social listening and backlink tools

Tools (free/paid): Google Search Console, Bing Webmaster, Ahrefs, SEMrush, Moz, Mention, Brandwatch, Meltwater, Talkwalker.
What to look for: backlinks that copy large text blocks, forum threads with detailed info, unattributed transcripts or manuals.

Probe models and retrieval systems

For public LLMs that provide citations, ask them to cite sources for statements about your brand and save those citations.
For closed models (no citations), use:
- Prompt probing (“What sources do you know about [YourBrand]? List URLs or publication names.”)
- Get model outputs on specific factual claims, then search exact phrases from the output in quotes to find origin pages.
If a vendor uses RAG (retrieval augmented generation) in your product, request the retrieval logs / source list — vendors often can provide it.

Use similarity / provenance tracing with embeddings

Create embeddings of:
- Public pages about your brand (your crawl + likely sources).
- Problematic model output text.
Run nearest‑neighbor search (cosine similarity) to find which page a model output most closely matches. Open-source toolkits: sentence-transformers, FAISS, Pinecone, Weaviate. This technique often identifies the likely source page even if the model doesn’t cite it.

Look for common public corpora

Many models are trained on large public crawls (Common Crawl), public social content, Wikipedia, and public code repositories. If you see content from such sources about your brand, it’s likely to have been included in training corpora.
Note: you usually cannot prove a model used a specific document unless the provider shares training metadata.

Legal, contractual and takedown steps

If content is copyrighted or sensitive: issue DMCA takedown notices to sites that host the content.
Ask platforms to remove sensitive personal / secret data (credentials, PII).
Review contracts with AI vendors — ask for provenance, data usage, and opt‑out rights. Many vendors will disclose when they use customer content in training; some offer opt‑outs.
Use robots.txt and meta noindex for pages you don’t want crawled — this won’t undo existing copies but reduces future crawling.

Preventive and corrective measures

Publish canonical, high‑quality authoritative content (FAQs, knowledge base, schema.org structured data) so retrieval systems return your pages as the authoritative source.
Add clear copyright and usage terms.
Remove or restrict old content (archives, support threads) that leak proprietary info.
Where possible, move sensitive docs behind authentication or paywalls.

Ongoing monitoring & escalation

Create a dashboard: web crawl results + alerts + top‑matching sources from embedding checks.
If you discover a model or service generating inaccurate or harmful content about your brand, contact the service provider with examples and request corrective action or source attribution.

Quick checklist (practical queries and actions)

Run these searches now:
- site:example.com "YourBrand"
- "YourBrand" "user manual" / "support" / "password" / "API key"
- exact-phrase searches of suspicious model outputs in quotes
Save model outputs you want traced (full text + timestamp).
Build embeddings for your public corpus and run nearest‑neighbor matches against the model output.
Ask your AI vendors for a list of sources used in any retrieval layer and for their data‑use policy.

Reality check

You often cannot get a definitive list of training examples from closed LLMs — the best practical approach is combination: web monitoring, similarity matching, vendor disclosure requests, and legal takedowns where appropriate.

If you want, I can:

Draft exact search queries for your brand name and a few likely sensitive phrases.
Provide a short script (Python) that computes embeddings and finds the closest web pages for a piece of model output. Which would you prefer?

Rank	Brand	Topic	LLM	Sentiment
1	🥇 ChatGPT	60%	0% 90% 90%	Neutral
2	🥈 Gemini	53%	0% 75% 85%	Neutral
3	🥉 Google	35%	40% 0% 65%	Neutral
4	BrandLight	35%	0% 55% 50%	Neutral
5	Peec.ai	32%	0% 60% 35%	Neutral
6	example.com	28%	85% 0% 0%	Neutral
7	Perplexity	28%	0% 85% 0%	Neutral
8	YourBrand	27%	80% 0% 0%	Neutral
9	Reddit	25%	75% 0% 0%	Neutral
10	Perplexity AI	25%	0% 0% 75%	Neutral
11	Google AI	25%	0% 75% 0%	Neutral
12	StackExchange	23%	70% 0% 0%	Neutral
13	Claude	23%	0% 0% 70%	Neutral
14	Trustpilot	22%	65% 0% 0%	Neutral
15	Surfer SEO	22%	0% 65% 0%	Neutral
16	Bing	20%	0% 0% 60%	Neutral
17	Yelp	18%	55% 0% 0%	Neutral
18	Evertune	18%	0% 0% 55%	Neutral
19	GitHub	17%	50% 0% 0%	Neutral
20	Yext Scout	17%	0% 50% 0%	Neutral
21	GitLab	15%	45% 0% 0%	Neutral
22	BrandMentions	15%	0% 0% 45%	Neutral
23	Pastebin	13%	40% 0% 0%	Neutral
24	Talkwalker	13%	40% 0% 0%	Neutral
25	Mentionlytics	13%	0% 0% 40%	Neutral
26	Wayback Machine	12%	35% 0% 0%	Neutral
27	Mention	12%	35% 0% 0%	Neutral
28	Brandwatch	12%	35% 0% 0%	Neutral
29	Meltwater	12%	35% 0% 0%	Neutral
30	Google Search Console	12%	35% 0% 0%	Neutral
31	Bing Webmaster	12%	35% 0% 0%	Neutral
32	Ahrefs	12%	35% 0% 0%	Neutral
33	SEMrush	12%	35% 0% 0%	Neutral
34	Moz	12%	35% 0% 0%	Neutral
35	Common Crawl	12%	35% 0% 0%	Neutral
36	Wikipedia	12%	35% 0% 0%	Neutral
37	Weaviate	12%	35% 0% 0%	Neutral
38	FAISS	12%	35% 0% 0%	Neutral
39	Pinecone	12%	35% 0% 0%	Neutral
40	SentenceTransformers	12%	35% 0% 0%	Neutral
41	Keyword.com	12%	0% 0% 35%	Neutral
42	Otterly.AI	12%	0% 0% 35%	Neutral
43	Brand24	12%	0% 0% 35%	Neutral
44	Am I On AI	12%	0% 0% 35%	Neutral

Rank

Brand

Topic

LLM

Sentiment

🥇

ChatGPT

60%

90%

Neutral

🥈

Gemini

53%

75%

85%

Neutral

🥉

Google

35%

40%

65%

Neutral

BrandLight

35%

55%

50%

Neutral

Peec.ai

32%

60%

35%

Neutral

example.com

28%

85%

Neutral

Perplexity

28%

85%

Neutral

YourBrand

27%

80%

Neutral

25%

75%

Neutral

Perplexity AI

25%

75%

Neutral

Google AI

25%

75%

Neutral

StackExchange

23%

70%

Neutral

Claude

23%

70%

Neutral

Trustpilot

22%

65%

Neutral

Surfer SEO

22%

65%

Neutral

Bing

20%

60%

Neutral

Yelp

18%

55%

Neutral

Evertune

18%

55%

Neutral

GitHub

17%

50%

Neutral

Yext Scout

17%

50%

Neutral

GitLab

15%

45%

Neutral

BrandMentions

15%

45%

Neutral

Pastebin

13%

40%

Neutral

Talkwalker

13%

40%

Neutral

Mentionlytics

13%

40%

Neutral

Wayback Machine

12%

35%

Neutral

Mention

12%

35%

Neutral

Brandwatch

12%

35%

Neutral

Meltwater

12%

35%

Neutral

Google Search Console

12%

35%

Neutral

Bing Webmaster

12%

35%

Neutral

Ahrefs

12%

35%

Neutral

SEMrush

12%

35%

Neutral

Moz

12%

35%

Neutral

Common Crawl

12%

35%

Neutral

Wikipedia

12%

35%

Neutral

Weaviate

12%

35%

Neutral

FAISS

12%

35%

Neutral

Pinecone

12%

35%

Neutral

SentenceTransformers

12%

35%

Neutral

Keyword.com

12%

35%

Neutral

Otterly.AI

12%

35%

Neutral

Brand24

12%

35%

Neutral

Am I On AI

12%

35%

Neutral

Domain	Title	LLM	URL
authoritas.com	authoritas.com	Gemini	https://www.authoritas.com/blog/how-to-choose-the-right-ai-brand-monitoring-tools-for-ai-search-llm-monitoring
irenechan.co	irenechan.co	Gemini	https://irenechan.co/monitor-chatgpt-brand-mentions-platforms/
netguru.com	netguru.com	Gemini	https://www.netguru.com/glossary/training-data-artificial-intelligence-explained
macgence.com	macgence.com	Gemini	https://macgence.com/blog/ai-training-data/
ai.koombea.com	koombea.com	Gemini	https://ai.koombea.com/blog/ai-training-data
datarade.ai	datarade.ai	Gemini	https://datarade.ai/data-categories/ai-ml-training-data
mentionlytics.com	mentionlytics.com	Gemini	https://www.mentionlytics.com/
vertexaisearch.cloud.google.com	berlinsbi.com	Gemini	https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH27Q-SLWvZPDTt6TxkvFstucZenSh0QX7ZOOdF0zdige3PdMe02SmImFmGbLczb7RY95PP_CUMKuClu4ts6KiSyxb2atp81o6c3DqxdZnWAReleupVoa5cmCAqBrXPIJRCu8WqPCM0gbKhNykidPaB5LhBRn77oFf9_7p0GowzUCI5rg==
medium.com	medium.com	Gemini	https://medium.com/@API4AI/brand-recognition-api-smarter-marketing-with-ai-7d46a1d28bc7
potterclarkson.com	potterclarkson.com	Gemini	https://www.potterclarkson.com/insights/what-data-is-used-to-train-an-ai-where-does-it-come-from-and-who-owns-it/
orbitmedia.com	orbitmedia.com	Gemini	https://www.orbitmedia.com/blog/optimize-for-ai-brand-mentions/
reddit.com	reddit.com	Gemini	https://www.reddit.com/r/ProductMarketing/comments/1k10tlt/top_5_tools_to_monitor_your_brands_presence_in_ai/
pollthepeople.app	pollthepeople.app	Gemini	https://pollthepeople.app/ai-brand-study/
dev.to	dev.to	Gemini	https://dev.to/alifar/ai-search-analytics-tracking-brand-visibility-in-ai-search-23ig
brandmentions.com	brandmentions.com	Gemini	https://brandmentions.com/
superlines.io	superlines.io	Gemini	https://www.superlines.io/articles/how-to-track-brand-mentions-in-ai-search-results
seerinteractive.com	seerinteractive.com	Gemini	https://www.seerinteractive.com/insights/what-drives-brand-mentions-in-ai-answers
surferseo.com	surferseo.com	Perplexity	https://surferseo.com/updates/ai-tracker/
wix.com	wix.com	Perplexity	https://www.wix.com/studio/ai-search-lab/best-tools-to-track-brand-visibility-in-ai-search
theysaid.io	theysaid.io	Perplexity	https://www.theysaid.io/blog/build-ai-brand-tracking-strategies
yext.com	yext.com	Perplexity	https://www.yext.com/blog/2025/08/do-you-know-how-your-brand-is-cited-ai-search
qualtrics.com	qualtrics.com	Perplexity	https://www.qualtrics.com/blog/ai-brand-tracking-strategies/
promptmonitor.io	promptmonitor.io	Perplexity	https://www.promptmonitor.io/blog/how-to-get-brand-mentioned-in-ai
revenuezen.com	revenuezen.com	Perplexity	https://revenuezen.com/top-ai-llm-brand-visibility-monitoring-tools-geo/
webfor.com	webfor.com	Perplexity	https://webfor.com/blog/how-to-get-your-brand-in-ai-search/
tryprofound.com	tryprofound.com	Perplexity	https://www.tryprofound.com
martechrecord.com	martechrecord.com	Perplexity	https://martechrecord.com/analysis-and-opinion/how-to-get-your-brand-to-show-up-in-ai-generated-search-results/

Domain

Title

LLM

URL