Big Tech's secret race for AI training data intensifies
The pursuit of artificial intelligence (AI) training data has led Big Tech companies into a clandestine market, vying for vast collections of digital content.
Once a dominant player, Photobucket, now with only 2 million users, finds its archive of 13 billion photos and videos, a potential goldmine for training generative AI models. CEO Ted Leonard disclosed ongoing discussions with tech giants to license this trove, with prices ranging widely based on the content and buyer.
This emerging market is propelled by the need for "foundation" AI models to learn from massive datasets. Initially, companies like Google, Meta, and Microsoft relied on freely scraped internet data but now seek legally and ethically sourced content to mitigate copyright and privacy concerns. For instance, Shutterstock has struck significant deals with Meta, Google, Amazon, and Apple, licensing its extensive library for AI training.
The demand for data extends beyond existing web content to include specially created or sourced materials, like podcasts, short-form videos, and even sensitive images used for content moderation training. Companies are willing to pay top dollar for high-quality, "ethically sourced" data that respects copyright and privacy norms.
However, this practice raises legal and ethical questions, especially when involving personal data from old social media platforms. The industry grapples with ensuring privacy and consent in using such data, highlighting the complex balance between technological advancement and ethical responsibility.