[Introduction to New Wisdom]The data shortage in 2026 is getting closer, and Silicon Valley manufacturers are already going crazy for AI training data! They have spent billions of dollars, hoping to dig out all the photos, videos, and chat records in every corner. However, what should we do if one day AI suddenly spits out our selfies or private chats?
Who would have thought that our chat records and old photos on social media from many years ago would suddenly become extremely valuable and be snapped up by big technology companies.
Now, the big companies in Silicon Valley have moved out one after another to buy up all the Internet data that can be used to purchase copyrights. This situation is almost overwhelming!
The old data from the image hosting website Photobucket has been ignored for many years, but now it is being snapped up by major Internet companies to train AI models.
To this end, technology giants are willing to spend real money. For example, each photo is worth 5 cents to $1, and each video is worth more than $1, depending on the buyer and the type of material.
In short, in order to purchase AI training data, giants have launched an underground competition!
The recent huge upset of the Meta image generator has exposed the “stereotypes” of AI training data.
If the data fed to the model cannot change the “bias”, then major companies will inevitably encounter public opinion storms.
Meta’s AI drawing tool cannot draw “Asian man and white wife” or “Asian woman and white husband”
Giants spend billions of dollars just to buy data “gold”
According to Reuters, at its peak in the 2000s, Photobucket had 70 million users. Today, the number of users of this top website has plummeted to 2 million.
But generative AI has brought new life to this company.
CEO Ted Leonard happily revealed that many technology companies have come to the door, willing to pay heavily to buy the company's 13 billion photos and videos.
The purpose, of course, is to train AI. In order to obtain this data, major companies are very willing to cut their own flesh and blood.
And, they want more! It is said that one buyer said that he wanted more than 1 billion videos, which was far beyond what Photobucket could provide.
According to rough estimates, the data held by Photobucket is likely worth billions of dollars.
OpenAI is involved in lawsuits, copyright is too sensitive
Now it seems that everyone’s data is not enough.
According to analysis by the Epoch Institute, technology companies are likely to exhaust all high-quality data on the Internet by 2026 because they consume data at a rate that far exceeds the rate at which it is generated!
The data for training ChatGPT is scraped from the Internet for free.
The source of Sora's training data is unknown, and CTO Murati's hesitant performance during the interview almost caused OpenAI to overturn again.
Although OpenAI says that its approach is completely legal, there are still a lot of copyright lawsuits waiting for them ahead.
Other big tech companies have followed suit and are quietly paying to lock content behind paywalls and login screens.
Now, whether it's an old chat log or an old faded photo on forgotten social media, it suddenly becomes something worth a lot of money.
Major companies have already mobilized one after another, eager to find authorization from copyright owners. After all, things in private collections cannot be captured.
Foreign media reporters interviewed more than 30 professionals and found that hidden behind this is a gold market.
Although many companies are silent on the size of this opaque AI market, researchers such as Business Research Insights believe that the current market size is approximately $2.5 billionand predicts that within ten years it may grow by nearly $30 billion.
Generating a data gold rush and making data businesses happy
For technology companies, if they cannot use free crawled web data files, such as Common Crawl, the cost will be a terrible figure.
But a series of copyright lawsuits and a regulatory boom have left them with no choice.
In fact, a new industry has emerged in Silicon Valley – data brokers.
And picture and video suppliers also made a lot of money.
Companies that are quick to respond have already responded. Within months of ChatGPT’s debut in late 2022, Meta, Google, Amazon, and Apple had quickly reached an agreement with image library provider Shutterstock to use the hundreds of millions of images, videos, and music files in the library for training.
The deals ranged from $25 million to $50 million, according to figures revealed by the chief financial officer.
Freepik, Shutterstock's competitor, already has two big buyers. Most of the 200 million image files will be licensed for 2 to 4 cents.
OpenAI is certainly not far behind. Not only is it an early customer of Shutterstock, but it has also signed licensing deals with at least four news organizations, including the Associated Press.
Make content “ethical”
Also emerging at the same time is the AI data customization industry.
The companies licensed real-world content like podcasts, short-form videos and interactions with digital assistants, while also building networks of short-term contractors to customize visuals and voice samples from scratch.
as one of the representatives Defined.aihas sold its content to many major technology companies such as Google, Meta, Apple, Amazon, and Microsoft.
Among them, a picture sells for 1 to 2 US dollars, a short video sells for 2 to 4 US dollars, a feature film can sell for 100 to 300 US dollars per hour, and the market price of text is 0.001 US dollars per word.
Nude images, which are more troublesome, sell for $5 to $7 because they still require post-processing.
The owners of these photos, podcasts and medical data will also receive a fee of 20% to 30% of the total transaction value.
One Brazilian data dealer said that in order to obtain images of crime scenes, conflict violence and surgeries, he needed to buy them from the police, freelance photojournalists and medical students.
He added that his company hired nurses accustomed to seeing violent injuries to desensitize and label the images, which would be disturbing to the untrained eye.
The work of desensitizing and annotating the images is left to nurses who are accustomed to seeing violent injuries. After all, untrained eyes will be very disturbed when seeing these images.
However, the “fuel” of these AI models is likely to cause serious problems, such as – spitting out user privacy.
Experts have found that AI regurgitates training data, spitting out Getty Images watermarks, outputting verbatim passages from New York Times articles, and even recreating images of real people.
In other words, private photos or private thoughts that someone posted decades ago may have been spat out by the AI model without their knowledge!
There is currently no effective solution to these hidden dangers.
Survey shows that users are willing to pay an extra US$1 per month to prevent their personal data from being used by third parties
Altman, also looking at synthetic data
In addition, Sam Altman has also seen the future of synthetic data.
This data is not created directly by humans, but is text, images, and code generated by AI models. That is, these systems progress by learning what they generate themselves.
Since AI can create text that is close to humans, of course it can also produce and sell it on its own, helping it evolve into a more advanced version.
As long as we can cross the critical threshold of synthetic data, that is, allow the model to independently create high-quality synthetic data, then all problems will be solved.
——Sam Altman
But is it really that easy?
AI researchers have been working with synthetic data for years, but building an AI system that can train itself is no easy task.
Experts have found that models that rely solely on self-generated data may repeat their own mistakes and limitations, becoming trapped in a self-reinforcing cycle.
The data these systems require is like trying to find a path in a jungle, and if they rely solely on synthetic data, they could get lost in the jungle.
——Jeff Clune, former OpenAI researcher and current computer science professor at the University of British Columbia
In this regard, OpenAI is exploring how to let two different artificial intelligence models collaborate to jointly generate higher quality and more reliable synthetic data. One of them is responsible for generating the data and the other is responsible for the evaluation.
Whether this approach is effective is unknown.
“Scale” Is All You Need
Why is data so important to AI models? This starts with the following paper.
In January 2020, Jared Kaplan, a theoretical physicist at Johns Hopkins University, and nine OpenAI researchers published a landmark artificial intelligence paper.
They came to a clear conclusion: the more data used to train a large language model, the better its performance.
Soon, “as long as the scale is large enough, everything is possible” became a consensus in the AI field.
Paper address:https://arxiv.org/abs/2001.08361
In November 2020, OpenAI launched GPT-3, which used the largest data at the time for training-about 300 billion tokens.
After ingesting this data, GPT-3 demonstrated astonishing text-generating capabilities—it could not only write blog posts, poems, and even write its own computer programs.
But now it seems that the size of this data set seems quite small.
By 2022, DeepMind has directly pulled the training data to 1.4 trillion tokens, which is more than Dr. Kaplan predicted in the paper.
However, this record did not last long.
In 2023, PaLM 2, released by Google, reached 3.6 trillion training tokens – almost twice the number of manuscripts collected by the Bodleian Library of Oxford University since 1602.
To train GPT-4, OpenAI used 1 million+ hours of YouTube videos for free
But as OpenAI CEO Sam Altman said, AI will eventually consume all available data resources on the Internet.
This is not a prophecy, nor is it alarmist – because Altman himself has seen it happen.
At OpenAI, research teams have been collecting, cleaning, and assembling data into huge libraries of text for years to train the company's language models.
They pulled information from GitHub, a computer code repository, assembled a database of chess moves, and used data on high school exams and assignments from the website Quizlet.
However, by the end of 2021, these data resources have been exhausted.
In order to develop the next generation AI model, President Brockman decided to personally go into battle.
Under his leadership, the team developed Whisper, a brand-new speech recognition tool that can quickly and accurately transcribe podcasts, audiobooks, and videos.
With Whisper, OpenAI quickly transcribed more than 1 million hours of YouTube videos, and Brockman personally participated in the collection work.
Everyone knows the final story. With the support of such high-quality data, the most powerful GPT-4 on the earth was born.
Google: Me too
Interestingly, Google has known for a long time that OpenAI is using YouTube videos to collect data, but has never thought of stopping it.
You guessed it right, Google is also using YouTube videos to train its own AI models.
And if they want to criticize OpenAI's behavior, they will not only expose themselves, but even trigger an even stronger reaction from the public.
Not only that, but the billions of text data stored in applications such as Google Docs and Google Sheets are also Google's targets.
In June 2023, Google's legal department asked the privacy team to modify the terms of service to expand the company's use of consumer data.
In other words, it paves the way for companies to develop a series of AI products using content shared publicly by users.
According to employees, they were specifically instructed to issue new terms in July, when everyone's attention was on the upcoming holidays.
The new terms released on July 1 allow Google to use this data not only to develop language models, but also to create a wide range of AI technologies and products like Google Translate, Bard, and Cloud AI.
Meta data is insufficient, executives are forced to hold meetings every day
Also catching up with OpenAI is Meta.
In order to surpass ChatGPT, Xiao Zha worked day and night to urge the company's executives and engineers to speed up the development of a competing chatbot.
However, by the beginning of last year, Meta also encountered the same problem as other competitors-insufficient data.
Although Meta manages a huge social network resource, not only do users not have the habit of retaining posts (many people delete their previous posts), but Facebook is not a place where people are accustomed to posting high-quality long posts.
Ahmad Al-Dahle, vice president of generative AI, told executives that to develop a model, his team used almost every English-language book, paper, poetry, and news article available on the Internet.
But this is not enough.
From March to April 2023, the company's business development leaders, engineers and lawyers held intensive meetings almost every day to try to find a solution.
They considered the possibility of paying $10 per copy for full rights to new books and discussed the idea of acquiring Simon & Schuster, which publishes works by authors such as Stephen King.
At the same time, they also discussed the practice of summarizing books, papers and other works on the Internet without permission, and considered further “absorbing” more content, even if this may lead to legal proceedings.
Fortunately, OpenAI, an industry benchmark, has used copyrighted materials without authorization, and Meta may be able to refer to this “market precedent.”
According to the recording, Meta executives decided to take a page from the 2015 Authors Guild court ruling against Google.
In that case, Google was allowed to scan, digitize, and catalog the book in an online database because it reproduced only a small portion of the work online and altered the original work, which was deemed fair use.
During the meeting, lawyers for Meta said that using data to train artificial intelligence systems should also be considered fair use.
But even so, Meta still doesn’t seem to have enough data…
AI photo-generating tool refuses to take photos of “whites and Asians”
Recently, a reporter from foreign media The Verge discovered after many attempts that Meta’s AI image generation tool cannot create a picture of an East Asian man and a white woman in the same frame.
Regardless of whether the prompt is “Asian man and white friend”, “Asian man and white wife”, “Asian woman and white husband”, or the modified “Asian man and white woman smiling with a dog”, it all depends on Nothing can be done.
When he changed “white” to “Caucasian,” the result remained the same.
For example, for the prompt “Wedding Day of Asian Men and Caucasian Women”, what I got was an image of an Asian man in a suit and an Asian woman in a cheongsam/kimono mix…
It’s really weird that AI can’t imagine Asians and white people standing side by side.
Moreover, there are more subtle biases hidden in the generated content.
For example, Meta always portrays “Asian women” as having East Asian faces, seemingly ignoring the fact that India is the most populous country in the world. At the same time, “Asian men” are mostly older, while Asian women are always younger.
In contrast, DALL-E 3 powered by OpenAI does not have this problem at all.
In this regard, some netizens pointed out that the reason for this problem is that Meta did not input enough scene examples during model training.
In short, the problem is not the code itself, but that the data set used when training the model is not rich enough and does not fully cover all possible scenarios.
But deeper than that, AI’s behavior is a reflection of the biases of its creators.
In American media, “Asians” usually refers to East Asians. Asians who do not fit this single image are almost erased from cultural consciousness, and even those who do fit are marginalized in mainstream media.
And this is just a part of AI bias caused by data.
References:
https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/
https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html
https://www.theverge.com/2024/4/3/24120029/instagram-meta-ai-sticker-generator-asian-people-racism