The A.I. industry confronts life after data scraping – Fortune

Hello and welcome to Eye on A.I. This past week, 12 data protection watchdogs from around the globe came together to issue a joint statement addressing data scraping and its effects on privacy. 
The statement—signed by privacy officials from Australia, Canada, Mexico, China, Switzerland, Columbia, Argentina, and the U.K., to name a few—takes aim at website operators, specifically social media companies, and states they have obligations under data protection and privacy laws to protect information on their platforms from unlawful data scraping. Even publicly accessible personal information is subject to these laws in most jurisdictions, asserts the statement. Notably, the statement also outlines that data scraping incidents that harvest personal information can constitute reportable data breaches in many jurisdictions.
In addition to publishing the statement, the authors state they sent it directly to Alphabet (YouTube), ByteDance (TikTok), Meta (Instagram, Facebook, and Threads), Microsoft (LinkedIn), Sina Corp (Weibo), and X Corp. (X, previously Twitter). They also suggest a series of controls these companies should have in place to safeguard users against harms associated with data scraping, including designating a team to monitor for and respond to scraping activities.
The potential harms outlined include cyberattacks, identity fraud, surveillance, unauthorized political or intelligence gathering, and unwanted marketing and spam. But while artificial intelligence isn’t once mentioned in the statement, it’s increasingly becoming a major flash point in this issue.
Scraping the internet—including the information on social media sites—is exactly how A.I. powerhouses like OpenAI, Meta, and Google obtained much of the data to train their models. And just in the past few weeks, data scraping has emerged as a major battlefront in the new A.I. landscape. The New York Times, for example, earlier this month updated its terms of service to prevent A.I. scraping of its content, and now the publisher is exploring suing OpenAI over the matter. This follows a proposed class-action lawsuit against OpenAI and investor Microsoft filed in June, which alleged the firm secretly scraped the personal information of hundreds of millions of users from the internet without notice, consent, or just compensation.  
A strongly worded letter is extremely unlikely to impact anything these tech giants do, but lawsuits and regulations against data scraping very well could. In the EU where data privacy and now A.I. regulation is moving fairly quickly, for example, data scraping is being increasingly scrutinized by governmental bodies. 
At its heart, A.I. is about data. So this begs the question: If companies aren’t able to freely scrape data, where will they get the data needed to train their models? 
One option is synthetic data, which refers to information that’s artificially generated rather than created by real-world events. This process often, but not always, involves using A.I. itself to create a large dataset of synthetic data from a smaller set of real-world data, with the resulting synthetic data mirroring the statistical properties of the real-world data. 
As long as the original data isn’t scraped, this could be a viable solution. Gartner estimates that synthetic data will overtake real-world data in A.I. models by 2030. But synthetic data has its drawbacks. For example, it can miss outliers, introduce inaccuracies, and, ideally, involve extra verification steps that slow down the process. And while some companies claim synthetic data eliminates bias, many experts refute this and see ways some forms of synthetic data can actually introduce additional biases into datasets. 
Another potential solution is opt-in first-party data. Unlike how real-world data has historically been scraped, used without permission, and even sold out from under users, this is real-world data that is opt-in and provided voluntarily.
Miami-based Streamlytics is one company working in the emerging opt-in first-party data space with the goal of making data streams more ethical. The company pays users to download their own data from sites they use, such as Netflix, and upload it to Streamlytics, which then packages it up and sells it to customers looking to purchase it. Customers can request specific types of data that they need, and users maintain ownership of the data and can request it be deleted at any time.
Founder and CEO Angela Benton told Eye on A.I. that her company has seen “a remarkable upsurge in interest” amid the current generative A.I. boom. A lot of that interest, she said, is from small and medium-sized businesses that are looking for solutions to train custom A.I. models. 
“In most cases, because of the size of these businesses, they lack the scale of data needed to train and customize their models,” she said. “They are actively seeking out solutions that can provide the data that they need and most are inclined towards models that are ethical from the ground up.”
As a result, Streamlytics is developing new offerings to cater to the surge of businesses jumping into generative A.I., such as allowing organizations to choose between purely human-generated data, synthetic data, or a blend of both, all of which is collected consensually. 
In conversations with customers, Benton said there is “a high degree of concern regarding legal backlash from using scraped data.”
“While everyone is enthusiastic about A.I. no one wants to be sued,” she said. “So there is an extra layer of diligence, especially from larger organizations, that includes reviewing processes of how data is sourced and timelines for when data is purged.”
It’s ironic that the larger organizations that created the very models that kicked off this generative A.I. boom didn’t do so with the same level of concern or diligence. What’s more, these companies have nearly unlimited resources and therefore are most equipped to take the ethical route. 
Even ImageNet, the dataset containing millions of tagged images that single-handedly catalyzed the rise of A.I. after it was released in 2010, was comprised largely of images scraped nonconsensually from the internet. From its modern beginnings, A.I. was built on stolen data, and now we’re entering its reckoning moment.
And with that, here’s the rest of this week’s A.I. news.
But first, a quick plug for Fortune’s upcoming Brainstorm A.I. conference in San Francisco on Dec. 11–12, where you’ll gain vital insights on how the most powerful and far-reaching technology of our time is changing businesses, transforming society, and impacting our future. Confirmed speakers include such A.I. luminaries as PayPal’s John Kim, Salesforce AI CEO Clara Shih, IBM’s Christina Montgomery, Quizlet’s CEO Lex Bayer, and moreApply to attend today!
Sage Lazzaro
OpenAI releases ChatGPT Enterprise. The new offering can perform the same tasks as ChatGPT, but offers higher speed GPT-4 access, customization options, advanced data analysis capabilities, admin tools for managing how employees use it, and “enterprise-grade” security and privacy. Essentially, while inputting your company’s sensitive information into the original ChatGPT wouldn’t be a good idea, ChatGPT Enterprise is built specifically to allow businesses to do just that. In its blog post announcing the new version, OpenAI emphasized that it does “not train on your business data or conversations, and our models don’t learn from your usage.”
DoorDash launches A.I.-powered voice ordering for restaurants. Citing that 20% of customers prefer to order takeout via phone but that up to 50% of restaurant calls go unanswered, DoorDash announced a new feature that will couple the use of A.I. with live agents to ensure all customer calls are promptly answered. The company claims the technology will allow restaurant employees to focus more on in-store customers without missing the potential revenue from customers trying to call in takeout orders.
The National Archives unveils its plan to use A.I. for record management. The agency tasked with managing all U.S. government documents–the National Archives and Records Administration–disclosed its interest in tapping A.I. for auto-filling metadata and responding to FOIA requests, according to FedScoop. Most federal government agencies are required to disclose their A.I. use case inventories as a result of a 2020 executive order. 
Hugging Face raises $235 million from Big Tech. Google, Amazon, Nvidia, Intel, AMD, Qualcomm, IBM, and Salesforce, as well as Sound Ventures, all participated in the Series D round, which valued the popular model repository and MLOps company at $4.5 billion. Hugging Face is one of the most well-funded A.I. companies, coming in behind OpenAI, Anthropic, Inflection AI, and just a few others, according to TechCrunch. The inclusion of Nvidia is especially interesting (and beneficial for Hugging Face), as companies big and small are vying for the firm’s attention in order to secure its highly valuable H100 GPUs. Even before the funding round, Hugging Face and Nvidia already had a working partnership.
Alibaba’s cloud division announces two new A.I. models as it eyes an IPO. That’s according to CNBC, which reports that the new releases, Qwen-VL and Qwen-VL-Chat, can better understand images and carry out more complex conversations compared to Alibaba’s earlier models. The new models come from Alibaba’s Cloud Intelligence Group, one of the six business units the Chinese mega-company split into earlier this year, which is pushing A.I. to reinvigorate its business as it prepares to go public, according to CNBC. The company says Qwen-VL and Qwen-VL-Chat are open-source (though details that would reveal how open they truly are aren’t yet available), and indeed allowing developers to build on its models could create an easy onramp for the cloud group to win more business. 
Quizzing LLMs. If an LLM like ChatGPT were to sit for an exam, it’d cross its fingers (keys?) that the questions would come in short answer or essay format. That’s because, according to a new research paper out of Megagon Labs, LLMs are kind of terrible at answering multiple-choice questions. 
Citing previous research that showed LLMs are sensitive to the wording of prompts and the fact that multiple-choice questions are common for testing models, the researchers sought to understand how the ordering of answers would affect a model’s response. They conducted a series of tests using OpenAI’s GPT-4 and InstructGPT and found a “considerable performance gap” of approximately 13% to 75% across the series of questions they posed to the LLMs. Essentially, just changing the order in which the choices were arranged often caused the model to go from selecting the correct answer to selecting an incorrect one. 
Overall, the researchers found the sensitivity occurs when the model is unsure between the top-2 or top-3 options, and they seemingly uncovered a pattern to how the ordering affects which answer the model ultimately chooses. “For amplifying bias, we found that the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options,” the wrote in the paper.
Major media organizations are putting up “do not enter” signs for ChatGPT —Rachyl Jones
Nvidia earnings hailed as historic moment for tech, but some warn A.I. is reaching fever pitch—‘this level of hype is dangerous’  —Chloe Taylor
China leaps forward in the A.I. arms race as Alibaba releases a new chatbot that can ‘read’ images —Paolo Confino
Hollywood shouldn’t entirely reject A.I.–it’s already delivering a new era of movie magic —Howard Wright
I pitted ChatGPT against a real financial advisor to help me save for retirement—and the winner is clear —Coryanne Hicks
Google’s three-day Cloud Next conference kicked off today in San Francisco, and already it’s off to quite a start with several new releases dropping by early morning.
The company announced new infrastructure tools optimized for A.I., including TPU v5e, the fifth generation of its tensor processing units for A.I. training and inferencing. With this version, Google is touting efficiency with a 2x improvement in training performance per dollar and a 2.5x improvement in inferencing performance per dollar, compared to the last generation. Overall, “Cloud TPU v5e consistently delivered up to 4X greater performance per dollar than comparable solutions in the market for running inference on our production ASR model,” reads the announcement blog post. Seeing as the high costs associated with training and then actually running A.I. models is one of the greatest hurdles and barriers to entry, along with accessing training data and compute power, we’re likely to be seeing even more focus on efficiency with future releases from Google and beyond. 
Google additionally announced several new models and tooling available in its Vertex AI cloud platform—including models from Meta (Llama 2 and Code Llama), Anthropic (Claude 2), and Falcon LLM, a popular open-source model from the Technology Innovative Institute. This means companies will be able to use these models for their own purposes from within Google’s platform, positioning the company as an all-in-one platform where customers can fulfill their cloud needs and access the major models driving the generative A.I. boom. 
Within Vertex, Google also announced digital watermarking powered by DeepMind SynthID. The company says this provides a “scalable approach to creating and identifying A.I.-generated images responsibly,” and claims it’s the first hyperscale cloud provider to offer this technology for A.I.-generated images. Digital watermarking has been increasingly thrown around as a solution to deciphering what’s human-made and what’s A.I.-made as our world quickly fills up with content generated by A.I., and this could be a first step in seeing if it actually works. 
Additionally, Google announced new upgrades to its Duet AI experiences for Google Meet and Google Chat. Perhaps the most interesting is the new A.I.-powered note-taking features, wherein the app will summarize a meeting in real-time, provide action items, and save the notes as well as video clips of important moments from the meeting to Google Docs for future reference. If a participant is late to a meeting, they can even talk privately with a Google chatbot that will catch them up on what they missed—all while the meeting is still happening. Pretty much everyone agrees that meetings, well, suck. With features like these, we may soon be wondering if we need to have meetings at all. Or, if companies continue to have them, will we even need to show up?
This is the online version of Eye on A.I., a free newsletter delivered to inboxes on Tuesdays. Sign up here.
© 2023 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information | Ad Choices 
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.
S&P Index data is the property of Chicago Mercantile Exchange Inc. and its licensors. All rights reserved. Terms & Conditions. Powered and implemented by Interactive Data Managed Solutions.