Tumblr and WordPress prepared to sell user data to Midjourney and OpenAI for profit.

Internal documents obtained by the media 404 show that Tumblr and WordPress are preparing to sell user data to generative AI companies MidJourney and OpenAI. If the types of data intended for each of these companies are not specified in the documents retrieved by the media, the latter also had access to internal exchanges clearly indicating that the agreements between Automattic, the parent company of the two platforms, and the AI companies are imminent.

“Internal documentation details a complicated and controversial process within Tumblr itself”, indicates our colleague from 404. She thus had access to an internal message written by Cyle Gage, product manager at Tumblr, indicating that a query carried out to prepare data for OpenAI and Midjourney compiled a large number of messages from users when it was not supposed to. It is not clear if this data has already been sent or if the intention is to provide a process to clean the data before sending it.

Advertisement

A massive sending of content to Midjourney and OpenAI including private data

Data related to Tumblr's public content between 2014 and 2023 was therefore compiled to be sent to the two AI companies. Only problem: private data was also included. This includes private messages on public blogs, posts on deleted or suspended blogs, unanswered requests (normally they are not public until they are answered), private responses (those -these only appear to the recipient and are not public), messages marked “explicit” / NSFW / “mature”.

Content from premium partner blogs also appears to have been sucked in. In his message, Cyle Gage specifies that it may be “special brand blogs like the old Apple music blog, for example, which spent money with us on an advertising campaign”. The product manager at Tumblr also seems slightly lost on this subject, specifying that this content “may contain creations that do not belong to us and that we do not have the right to share”.

Additional settings to protect private data

Advertisement

This February 27, Automattic released a press release equivocal to say the least. With the title “Protect user choice”, the firm writes that if AI quickly transforms the way we create and consume content, “(she has) always believed in a free and open web and in individual choice” and attaches great importance to respecting the preferences of its users. It therefore released more options intended to strengthen control over content created by users of WordPress.com and Tumblr.

“We currently block, by default, major AI platform crawlers, including those from the largest technology companies, and update our lists as new ones are launched.” . Automattic has long had a setting to discourage search engines from indexing a site on WordPress.com and Tumblr and has just added similar settings to WordPress.com and Tumblr to discourage this scanning by AI companies and prevent any sharing of data with third parties.

Wanting to be reassuring as to the content actually shared with third parties, Automattic noted the following: “We will only share public content hosted on WordPress.com and Tumblr from profiles that have not opted in to this setting.” And to add: “We also plan to go further by regularly notifying all our partners of people who have recently unsubscribed and requesting that their content be removed from past sources and upcoming workouts.”

A process shared by other companies with disregard for users

A story reminiscent of the partnership concluded between Reddit and Google. A few weeks before its IPO, the platform sought to prove its economic potential. It therefore signed an agreement worth $60 million on an annual basis with the search giant to allow it to train its models on the platform's content. Data could therefore be recovered without users having explicitly given their authorization.

OpenAI also appealed for data donations. The firm unveiled its Open AI Data Partnerships initiative on November 9, 2023, which promotes improvement of its AI models without any remuneration in return, on the grounds that its models “will benefit all of humanity”. Not sure if the argument has really worked so far.

Will the AI ​​Act protect users from this “data vacuuming”?

To date, there are no regulations requiring these crawlers to follow the preferences previously cited by Automattic. However, this could change with the draft European regulation on artificial intelligence – the AI ​​Act – which was unanimously validated by the ambassadors of the twenty-seven countries of the European Union meeting in Brussels on February 2. .

Selected for you

(MWC 2024) Telecom operators join forces to develop generative AI models

Advertisement