A team led by computer scientists from examined ten of the most-cited datasets used to test systems. They found that around 3.4 percent of the data was inaccurate or mislabeled, which could cause problems in AI systems that use these datasets.
The datasets, which have each been cited more than 100,000 times, include text-based ones from newsgroups, and . Errors emerged from issues like Amazon product reviews being mislabeled as positive when they were actually negative and vice versa.
Some of the image-based errors result from mixing up animal species. Others arose from mislabeling photos with less-prominent objects (“water bottle” instead of the mountain bike it’s attached to, for instance). One particularly galling example that emerged was a baby being confused for a nipple.
centers around audio from YouTube videos. of a YouTuber talking to the camera for three and a half minutes was labeled as “church bell,” even though one could only be heard in the last 30 seconds or so. Another error emerged from a misclassification of as an orchestra.
To find possible errors, the researchers used a framework called , which examines datasets for label noise (or irrelevant data). They validated the possible mistakes using , and found around 54 percent of the data that the algorithm flagged had incorrect labels. The researchers found the had the most errors with around 5 million (about 10 percent of the dataset). The team so that anyone can browse the label errors.
Some of the errors are relatively minor and others seem to be a case of splitting hairs (a closeup of a Mac command key labeled as a “computer keyboard” is still correct). Sometimes, the confident learning approach got it wrong too, like confusing a correctly labeled image of tuning forks for a menorah.
If labels are even a little off, that could lead to huge ramifications for machine learning systems. If an AI system can’t tell the difference between a grocery and a bunch of crabs, it’d be hard to trust it with pouring you a drink.