Google's Gemini and other mainstream AI chatbots have begun surfacing real phone numbers associated with real people—not as a rare glitch, but as a recurring pattern significant enough that privacy removal companies are seeing a 400% surge in requests over seven months. The incidents are documented: a software developer in Israel was contacted on WhatsApp after Gemini provided his number in fabricated customer service instructions; a University of Washington PhD candidate retrieved a colleague's personal cell from the same tool; a Redditor reported his phone overwhelmed by misdirected calls from strangers seeking lawyers, product designers, and locksmiths, all courtesy of Google's AI. These aren't isolated cases surfaced by security researchers hunting for edge cases. They're reports from ordinary users encountering a tangible harm that cascades directly into their lives. The volume suggests this is happening far more widely than reaches public attention—most people being exposed through AI systems never speak up.
The immediate cause traces back to personally identifiable information encoded in training data, though the exact mechanism remains opaque even to researchers. Generative AI models trained on internet-scale corpora inevitably absorb PII—phone numbers, addresses, email addresses, family names—because that information exists in the public web. When an AI system encounters a prompt or context that activates these associations, it can reproduce them with uncanny accuracy, or hallucinate plausible-but-wrong numbers that still harm innocent recipients. This isn't new as a risk category; AI researchers and privacy advocates have warned about it for years. What's new is watching it unfold at scale in production systems serving millions of users, with neither clear prevention mechanisms nor accountability structures in place. The warning came before the incident, and the industry shipped anyway.
This problem illuminates a fundamental tension baked into how modern generative AI scales. Training on web-broad data is the shortcut to capability—it's why these models are so useful and so dangerous. Filtering out all PII would require either massive curation overhead or a dramatic shrinkage in training sets, both incompatible with the business logic that drives AI development. So the industry has implicitly accepted that exposure is a side effect of capability, hoping it remains rare enough to go unnoticed. The data from DeleteMe—55% of concerns targeting ChatGPT, 20% Gemini, 15% Claude—shows the problem isn't vendor-specific; it's structural to how these systems work. Individual companies can tweak their policies or add guardrails, but the real risk lies in the training data itself, which most users have no way to audit or remove themselves from.
The exposure pattern reveals a deeper vulnerability: people whose information is most likely to be found online—developers, researchers, academics, entrepreneurs, any professional with a public presence—are simultaneously the most likely to adopt and use these tools. They're the early users, the beta testers, the ones building on top of these platforms. For this cohort, AI chatbots now represent an active privacy liability, not just a convenience. A researcher can't unknown-ize her own phone number from Gemini's weights. A developer can't opt out of being in someone else's training data. The asymmetry is stark: the people most dependent on these tools for productivity are the ones most exposed by them. Smaller or less-public individuals face less immediate risk, but that's only because they're less discoverable, not because they're protected.
From a competitive standpoint, this exposure creates friction in enterprise AI adoption. Companies evaluating Claude, ChatGPT, or Gemini for sensitive use cases must now contend with the fact that these systems might surface—or hallucinate—customer data, employee data, or proprietary information. Insurance companies, legal firms, and healthcare organizations can't simply plug in a general-purpose chatbot without worrying about what it might leak. This isn't just a regulatory risk; it's a trust risk. The moment a customer learns their contact information was exposed through a vendor's AI system, the liability story becomes messier. Vendors face pressure to add guardrails, users face pressure to avoid PII-heavy queries, and the entire ecosystem fractures into privacy-conscious alternatives and convenience-first players.
The real question isn't whether this will happen again—it will—but whether the industry will develop meaningful solutions before regulation demands it. Right now, there's no opt-out mechanism, no clear removal path, no standardized liability framework. DeleteMe's surge in requests suggests users are already responding by seeking privacy-removal services or adopting AI-avoidance strategies, which is inefficient and suggests the market isn't solving it voluntarily. The next phase will likely involve either regulatory pressure demanding training data transparency and user consent mechanisms, or a bifurcation where privacy-critical applications refuse to use general-purpose models and instead adopt fine-tuned systems trained on curated data. The companies that can solve training data integrity—whether through synthetic data, explicit exclusion mechanisms, or consent frameworks—will gain an edge in regulated and enterprise markets. Until then, every conversation with a major AI chatbot carries a privacy cost that users bear quietly.
This article was originally published on MIT Technology Review — AI. Read the full piece at the source.
Read full article on MIT Technology Review — AI →DeepTrendLab curates AI news from 50+ sources. All original content and rights belong to MIT Technology Review — AI. DeepTrendLab's analysis is independently written and does not represent the views of the original publisher.