How Does Synthetic Data Impact AI Hallucinations?
Although synthetic data is a powerful tool, it can only reduce artificial intelligence hallucinations under specific circumstances. In almost every other case, it will amplify them. Why is this? What does this phenomenon mean for those who have invested in it? How Is Synthetic Data Different From Real Data? Synthetic data is information that is […] The post How Does Synthetic Data Impact AI Hallucinations? appeared first on Unite.AI.
Although synthetic data is a powerful tool, it can only reduce artificial intelligence hallucinations under specific circumstances. In almost every other case, it will amplify them. Why is this? What does this phenomenon mean for those who have invested in it?
How Is Synthetic Data Different From Real Data?
Synthetic data is information that is generated by AI. Instead of being collected from real-world events or observations, it is produced artificially. However, it resembles the original just enough to produce accurate, relevant output. That’s the idea, anyway.
To create an artificial dataset, AI engineers train a generative algorithm on a real relational database. When prompted, it produces a second set that closely mirrors the first but contains no genuine information. While the general trends and mathematical properties remain intact, there is enough noise to mask the original relationships.
An AI-generated dataset goes beyond deidentification, replicating the underlying logic of relationships between fields instead of simply replacing fields with equivalent alternatives. Since it contains no identifying details, companies can use it to skirt privacy and copyright regulations. More importantly, they can freely share or distribute it without fear of a breach.
However, fake information is more commonly used for supplementation. Businesses can use it to enrich or expand sample sizes that are too small, making them large enough to train AI systems effectively.
Does Synthetic Data Minimize AI Hallucinations?
Sometimes, algorithms reference nonexistent events or make logically impossible suggestions. These hallucinations are often nonsensical, misleading or incorrect. For example, a large language model might write a how-to article on domesticating lions or becoming a doctor at age 6. However, they aren’t all this extreme, which can make recognizing them challenging.
If appropriately curated, artificial data can mitigate these incidents. A relevant, authentic training database is the foundation for any model, so it stands to reason that the more details someone has, the more accurate their model’s output will be. A supplementary dataset enables scalability, even for niche applications with limited public information.
Debiasing is another way a synthetic database can minimize AI hallucinations. According to the MIT Sloan School of Management, it can help address bias because it is not limited to the original sample size. Professionals can use realistic details to fill the gaps where select subpopulations are under or overrepresented.
How Artificial Data Makes Hallucinations Worse
Since intelligent algorithms cannot reason or contextualize information, they are prone to hallucinations. Generative models — pretrained large language models in particular — are especially vulnerable. In some ways, artificial facts compound the problem.
Bias Amplification
Like humans, AI can learn and reproduce biases. If an artificial database overvalues some groups while underrepresenting others — which is concerningly easy to do accidentally — its decision-making logic will skew, adversely affecting output accuracy.
A similar problem may arise when companies use fake data to eliminate real-world biases because it may no longer reflect reality. For example, since over 99% of breast cancers occur in women, using supplemental information to balance representation could skew diagnoses.
Intersectional Hallucinations
Intersectionality is a sociological framework that describes how demographics like age, gender, race, occupation and class intersect. It analyzes how groups’ overlapping social identities result in unique combinations of discrimination and privilege.
When a generative model is asked to produce artificial details based on what it trained on, it may generate combinations that did not exist in the original or are logically impossible.
Ericka Johnson, a professor of gender and society at Linköping University, worked with a machine learning scientist to demonstrate this phenomenon. They used a generative adversarial network to create synthetic versions of United States census figures from 1990.
Right away, they noticed a glaring problem. The artificial version had categories titled “wife and single” and “never-married husbands,” both of which were intersectional hallucinations.
Without proper curation, the replica database will always overrepresent dominant subpopulations in datasets while underrepresenting — or even excluding — underrepresented groups. Edge cases and outliers may be ignored entirely in favor of dominant trends.
Model Collapse
An overreliance on artificial patterns and trends leads to model collapse — where an algorithm’s performance drastically deteriorates as it becomes less adaptable to real-world observations and events.
This phenomenon is particularly apparent in next-generation generative AI. Repeatedly using an artificial version to train them results in a self-consuming loop. One study found that their quality and recall decline progressively without enough recent, actual figures in each generation.
Overfitting
Overfitting is an overreliance on training data. The algorithm performs well initially but will hallucinate when presented with new data points. Synthetic information can compound this problem if it does not accurately reflect reality.
The Implications of Continued Synthetic Data Use
The synthetic data market is booming. Companies in this niche industry raised around $328 million in 2022, up from $53 million in 2020 — a 518% increase in just 18 months. It’s worth noting that this is solely publicly-known funding, meaning the actual figure may be even higher. It’s safe to say firms are incredibly invested in this solution.
If firms continue using an artificial database without proper curation and debiasing, their model’s performance will progressively decline, souring their AI investments. The results may be more severe, depending on the application. For instance, in health care, a surge in hallucinations could result in misdiagnoses or improper treatment plans, leading to poorer patient outcomes.
The Solution Won’t Involve Returning to Real Data
AI systems need millions, if not billions, of images, text and videos for training, much of which is scraped from public websites and compiled in massive, open datasets. Unfortunately, algorithms consume this information faster than humans can generate it. What happens when they learn everything?
Business leaders are concerned about hitting the data wall — the point at which all the public information on the internet has been exhausted. It may be approaching faster than they think.
Even though both the amount of plaintext on the average common crawl webpage and the number of internet users are growing by 2% to 4% annually, algorithms are running out of high-quality data. Just 10% to 40% can be used for training without compromising performance. If trends continue, the human-generated public information stock could run out by 2026.
In all likelihood, the AI sector may hit the data wall even sooner. The generative AI boom of the past few years has increased tensions over information ownership and copyright infringement. More website owners are using Robots Exclusion Protocol — a standard that uses a robots.txt file to block web crawlers — or making it clear their site is off-limits.
A 2024 study published by an MIT-led research group revealed the Colossal Cleaned Common Crawl (C4) dataset — a large-scale web crawl corpus — restrictions are on the rise. Over 28% of the most active, critical sources in C4 were fully restricted. Moreover, 45% of C4 is now designated off-limits by the terms of service.
If firms respect these restrictions, the freshness, relevancy and accuracy of real-world public facts will decline, forcing them to rely on artificial databases. They may not have much choice if the courts rule that any alternative is copyright infringement.
The Future of Synthetic Data and AI Hallucinations
As copyright laws modernize and more website owners hide their content from web crawlers, artificial dataset generation will become increasingly popular. Organizations must prepare to face the threat of hallucinations.
The post How Does Synthetic Data Impact AI Hallucinations? appeared first on Unite.AI.