OpenAI offers 20 million user chats in ChatGPT lawsuit. NYT wants 120 million. – Ars Technica

OpenAI asks judge to drastically limit NYT access to ChatGPT logs.
OpenAI is preparing to raise what could be its final defense to stop The New York Times from digging through a spectacularly broad range of ChatGPT logs to hunt for any copyright-infringing outputs that could become the most damning evidence in the hotly watched case.
In a joint letter Thursday, both sides requested to hold a confidential settlement conference on August 7. Ars confirmed with the NYT’s legal team that the conference is not about settling the case but instead was scheduled to settle one of the most disputed aspects of the case: news plaintiffs searching through millions of ChatGPT logs.
That means it’s possible that this week, ChatGPT users will have a much clearer understanding of whether their private chats might be accessed in the lawsuit. In the meantime, OpenAI has broken down the “highly complex” process required to make deleted chats searchable in order to block the NYT’s request for broader access.
Previously, OpenAI had vowed to stop what it deemed was the NYT’s attempt to conduct “mass surveillance” of ChatGPT users. But ultimately, OpenAI lost its fight to keep news plaintiffs away from all ChatGPT logs.
After that loss, OpenAI appears to have pivoted and is now doing everything in its power to limit the number of logs accessed in the case—short of settling—as its customers fretted over serious privacy concerns. For the most vulnerable users, the lawsuit threatened to expose ChatGPT outputs from sensitive chats that OpenAI had previously promised would be deleted.
Most recently, OpenAI floated a compromise, asking the court to agree that news organizations didn’t need to search all ChatGPT logs. The AI company cited the “only expert” who has so far weighed in on what could be a statistically relevant, appropriate sample size—computer science researcher Taylor Berg-Kirkpatrick. He suggested that a sample of 20 million logs would be sufficient to determine how frequently ChatGPT users may be using the chatbot to regurgitate articles and circumvent news sites’ paywalls.
But the NYT and other news organizations rejected the compromise, OpenAI said in a filing yesterday. Instead, news plaintiffs have made what OpenAI said was an “extraordinary request that OpenAI produce the individual log files of 120 million ChatGPT consumer conversations.”
That’s six times more data than Berg-Kirkpatrick recommended, OpenAI argued. Complying with the request threatens to “increase the scope of user privacy concerns” by delaying the outcome of the case “by months,” OpenAI argued. If the request is granted, it would likely trouble many users by extending the amount of time that users’ deleted chats will be stored and potentially making them vulnerable to a breach or leak.
As negotiations potentially end this week, OpenAI’s co-defendant, Microsoft, has picked its own fight with the NYT over its internal ChatGPT equivalent tool that could potentially push the NYT to settle the disputes over ChatGPT logs.
According to the NYT, it’s necessary to search through 120 million ChatGPT users’ conversations. News plaintiffs want the opportunity to prove not just that infringing outputs may be happening frequently, but they also want to document any patterns showing spikes in infringement.
As OpenAI explained, the NYT and other news plaintiffs suing “insist that they should be entitled to conduct a full-scale analysis on every single month during the relevant 23-month time period—notwithstanding the burden—so that they can evaluate how the product has changed over time.”
OpenAI argued that the NYT shouldn’t be allowed to search for evidence of how “the prevalence of regurgitation changed over time. That “kind of extraordinarily granular analysis is disproportionate to the issues in dispute,” they claimed. However, the news plaintiffs seemingly want to make the most of the access granted to search the logs to plead their best case.
There’s no telling if the judge who immediately granted the NYT such broad access, Ona Wang, will be sympathetic to OpenAI’s arguments at this stage of the battle. But OpenAI has stressed that by neglecting to limit the sample size, the court will be dragging out the case, since each user’s individual chat logs will take substantial time to make searchable:
Plaintiffs seek 120 million records from OpenAI’s offline storage system, which is composed of individual conversation logs. The logs are not rows in a spreadsheet; they are large, unstructured data files—meaning that they do not follow a predefined format—consisting of over 5,000 words, even for very short conversations. The logs must be decompressed before being searched and contain identifying information (e.g., addresses) and other private information (e.g., passwords) that must be scrubbed before making it available.
For OpenAI, this process is “highly complex,” requiring it to retrieve each log from “the tens of billions of logs in OpenAI’s offline data storage.” The company will then incur costs of storing those logs, making the NYT’s request for 120 million user conversations six times as expensive as OpenAI’s.
“Each of these steps requires time, computational resources, and OpenAI engineers to design, debug, operate, and monitor the relevant systems,” OpenAI argued, estimating that 20 million logs would take 12 weeks, while 120 million logs would take 36 weeks to decompress and de-identify.
Because of this supposed burden, OpenAI has asked the court to deny the NYT’s request or else proceed with searching 20 million logs until news plaintiffs can “demonstrate that their ability to prosecute their claims will be materially prejudiced absent another sample.”
It’s unclear if the NYT will agree to limit the sample as part of this week’s settlement conference. But the NYT may be motivated to settle, as the newspaper has recently strongly opposed Microsoft’s requests to compel NYT reporters’ privileged logs from its internal alternative to ChatGPT, a service called ChatExplorer.
In its defense, NYT has argued that Microsoft’s request is too broad—demanding more than 80,000 logs, including logs from journalists and NYT lawyers “who have nothing to do with this case.” If that defense sounds like OpenAI’s arguments over ChatGPT logs to you, don’t worry, the NYT explains why the two requests for chat samples are supposedly very different.
According to the NYT, its request for ChatGPT logs properly seeks “direct evidence of copyright infringement,” while Microsoft “does not need” to access ChatExplorer data, which allegedly might only be used to “support its substantial non-infringing uses and fair use defenses.”
Since the NYT has already provided evidence that shows that its journalists use “the accused products” for “transformative purposes” in service of Microsoft’s defenses—and Microsoft failed to tailor its request to certain employees or search terms—the newspaper has argued that Microsoft’s request would needlessly pull in privileged logs of 58 NYT reporters and lawyers without furthering those arguments.
It’s possible that the NYT’s defense is strong enough to give news plaintiffs leverage in the settlement that could come this week over ChatGPT logs. Recognizing that possibility could be the reason OpenAI CEO Sam Altman recently floated the idea of “AI privilege,” where any chats between users and chatbots are considered confidential, VentureBeat reported.
Ars Technica has been separating the signal from the noise for over 25 years. With our unique combination of technical savvy and wide-ranging interest in the technological arts and sciences, Ars is the trusted source in a sea of information. After all, you don’t need to know everything, only what’s important.

source

OpenAI offers 20 million user chats in ChatGPT lawsuit. NYT wants 120 million. – Ars Technica

OpenAI offers 20 million user chats in ChatGPT lawsuit. NYT wants 120 million. – Ars Technica

Jesse

https://playwithchatgtp.com