OpenAI Just Released Its Powerful New ChatGPT Agent – Sequoia Capital
Isa Fulford, Casey Chu, and Edward Sun from OpenAI’s ChatGPT Agent team reveal how they combined Deep Research and Operator into a single, powerful AI agent that can perform complex, multi-step tasks lasting up to an hour. By giving the model access to a virtual computer with text browsing, visual browsing, terminal access and API integrations—all with shared state—they’ve created what may be the first truly embodied AI assistant. The team discusses their reinforcement learning approach, safety mitigations for real-world actions, and how small teams can build transformative AI products through close research-applied collaboration.
OpenAI’s ChatGPT agent team discusses the origin, design and ambitions of this new system that unifies research and operator capabilities for more powerful, flexible automation. The episode highlights both the technical advances and the collaborative approach that enabled the rapid development of this multi-tool agent—and emphasized the challenges and opportunities as AI agents become more capable and integrated into real workflows.
Combining specialized agents unlocks new capabilities: By merging the strengths of Operator (visual, GUI-based actions) and Deep Research (text browsing and synthesis), the team achieved a system that can handle diverse, end-to-end tasks—everything from online shopping to spreadsheet automation.
Shared state and tool orchestration are key breakthroughs: The agent’s ability to switch seamlessly between tools—browser, terminal, APIs and more—with shared state mimics how humans use computers, dramatically increasing flexibility and efficiency.
Human-in-the-loop and collaborative design matters: The system allows for ongoing, interruptible interaction, enabling users to clarify, redirect or take over tasks mid-process, which leads to more robust and user-aligned outcomes.
Safety and real-world integration introduce new challenges: Granting agents the power to take actions with real-world effects required substantial investment in guardrails, monitoring and cross-team safety processes, especially given the risks of automation and internet interaction.
Small, cross-functional teams with rapid iteration can drive major leaps: The project was delivered by a surprisingly small, tightly integrated team, with blurred lines between research and engineering, showing that ambitious AI initiatives can move quickly through deep collaboration and clear product grounding.
Isa Fulford: I think this model is actually very good at multi-turn conversations, and it’s very nice to continue working on a task with it. I think that’s one of the deficiencies of Deep Research. A lot of people will do multiple Deep Research requests in a single conversation, but it doesn’t always work so well, so I think we’re really happy with this model’s multi-turnability, and we just want to improve even further. And then I also think personalization and memory for agents will also be very important, and right now every agent task is initiated by the user, but in future it should also be doing things for you without you having to even ask in the first place.
Lauren Reeder: Today we’re exploring the evolution of AI agents with Isa Fulford, Casey Chu and Edward Sun, the OpenAI team behind the new ChatGPT agent. You’ll learn how they got to a huge leap forward in capability by unifying the architecture across Deep Research and Operator, allowing for multiple tools to share a state, giving users fluid transitions between visual browsing, text analysis and code execution all within a single environment. We’ll discuss their training approach. Rather than programming specific tool usage patterns, they let the models discover the optimal strategies through reinforcement learning across thousands of virtual machines. They’ve created an agent that can work alongside you for hours, asking clarifying questions and accepting mid-task corrections, expanding the ways that we can interact with AI agents. The team shares fascinating challenges around safety, guard relevance around agent activities, and why things like date picking still remain mysteriously difficult for AI systems. They’ve revealed how small, focused teams are achieving breakthrough capabilities through careful data curation, suggesting that we’re now entering a new phase of AI development where product insights matter just as much as compute power. Enjoy the show.
Lauren Reeder: Isa, Casey, Edward, thank you for joining us today.
Isa Fulford: Thank you so much for having us.
Casey Chu: Thank you.
Lauren Reeder: So you’re the team behind the ChatGPT agent or agent mode. What is it?
Isa Fulford: Yeah, so this has been a collaboration between the former Deep Research and Operator teams. And we’ve created a new agent in ChatGPT that’s able to carry out tasks that would take humans a long time. And we gave the agent access to a virtual computer, and through that it has two different ways to access the internet—actually, more ways, but we’ll get to that. It has a text browser, which is similar to Deep Research tools, so it’s able to efficiently access information online and search through things with this very fast text browsing tool. And then it also has a virtual browser which is similar to the operator tool, so it actually has full access to the graphical user interface, and it’s able to click and type things into forms and scroll and drag and all these kinds of things. So together it’s much more powerful than either of those two tools because one’s more efficient and one’s much more flexible.
And then we also gave it access to a terminal, so it’s able to run code and analyze files and create artifacts for you like spreadsheets or slides. We also, through the terminal, it’s able to call APIs, so either public APIs or private APIs. If you sign in, it could access your GitHub or Google Drive, SharePoint, many other things. And the cool thing about this tool is all of the tools have shared state, so it’s similar to if you’re using a computer, like, all of your different applications have access to the same file system and things like that. It’s the same for the tool, so the model can do quite flexible things. And yeah, we’ll talk more about this later, but I think it’s just a very flexible way for the model to do very complex tasks on behalf of users.
Lauren Reeder: Tell us a little about the origin story. How did this get started?
Casey Chu: Well, our team worked on Operator.
Isa Fulford: And our team worked on Deep Research.
Casey Chu: And so back in January, we released our first agent, Operator. This is a product that can do internet tasks for you, like buy things on the internet, shop for you, this kind of thing. And then two weeks later …
Isa Fulford: We released Deep Research, which is a different model that’s—or different product that’s able to extensively browse the internet, synthesize information, and it creates a long research report with citations for you.
Casey Chu: And we were kind of thinking through our roadmap and we were kind of like, “Hey, this is kind of a match made in heaven here.” So, you know, Operator is really good at visual interacting with a web page, but it’s less good at kind of the text browser, like, reading long articles. Whereas Deep Research is really good at reading long articles, but it has a tougher time with interactive elements or highly visual things.
Isa Fulford: Because the tools are different. So Deep Research has a text browser, so it’s able to really efficiently read information and search and synthesize information, but it’s not able to, like, scroll and click in the same way or fill out forms in the same way that Operator is, because it has actually full access to the GUI browser. And as Casey was mentioning, like, Deep Research has some things that Operator doesn’t have. And then similarly, one of the biggest requests for Deep Research is for the model to be able to access paywall sources or things that you have to pay a subscription for, and Operator is able to do that.
Casey Chu: And also, one of our members of our team, Eric, he was running an analysis on the types of prompts that people were trying on Operator. And we realized that it was a lot of Deep Research-type tasks, like “Research this trip for me, then book it.” So it really is a natural combination.
Sonya Huang: In what way is one plus one equals three?
Edward Sun: So in Deep Research, we always wanted to figure out how to let Deep Research have access to a real browser that can load in all the real contents that previous Deep Research cannot have access to.
Casey Chu: It’s funny that you bring up the “one plus one equals three” because not only did we combine Deep Research and Operator, but we also threw in a bunch of other tools that—basically, like, everything we could think of. So, like, a terminal tool is there, so it can run commands, do calculations. The Image Gen tool is a fun one. If it wants to spruce up its slides by making an image, it can do that.
Isa Fulford: Can call APIs.
Lauren Reeder: It can produce PowerPoints.
Isa Fulford: Yes. Yeah, it can do a lot of different things.
Lauren Reeder: Yeah, tell us a little bit how people are using it, knowing it’s still early days.
Isa Fulford: So I think the cool thing about it is we have some ideas of how we think people are going to use it, but I think we intentionally kept it quite open-ended. I mean, it’s called Agent; that’s so vague. Partially because we are excited to see how people end up using it. So I think some of the things that we specifically trained it for were, of course, Deep Research-type tasks, so things where you want a long report on a topic, Operator-type tasks where you want it to do something for you, like book something or book a flight, buy something for you. And then also tasks to make slide decks. We also, you know, spent a lot of effort on making spreadsheets and doing data analysis, but I think there are also just so many other things the model can do, so we’re just excited to see how people use it. Kind of similarly to how when we launched Deep Research, we saw a lot of people using it for code search, which was really surprising to us. We’re hoping to see a lot of new use cases that we didn’t even think of ourselves.
Sonya Huang: Would you guess it’d be more consumer or kind of B2B-type use cases? Or is that a false dichotomy?
Isa Fulford: Hopefully both.
Sonya Huang: Okay.
Casey Chu: I think we’re kind of aiming for the prosumer, like, someone who’s willing to wait 30 minutes for a detailed report, but that can be in the consumer case or at your job. I think it could be good for both.
Lauren Reeder: Do any of you have favorite things you’ve used it for?
Edward Sun: For me, it’s more like pulling data from our spreadsheets or Google Docs, like a document in our expanded log, and then make some slides to present the data or, like, organize the data. It’s pretty useful.
Casey Chu: I’ve been doing a deep dive into ancient DNA.
Lauren Reeder: [laughs]
Casey Chu: It’s just one of my interests. And there’s actually a lot of exciting work going on these past five years. They’re sequencing all this DNA and, like, discovering all these facts about where did this group of people come from and historical stuff. The problem is that everything is so new that there isn’t a reference source material to summarize a survey of these materials. But Agent can go out and pull together all these sources and synthesize it into a report that I can read or slides that I can read. And I think it’s kind of made for this topic.
Isa Fulford: Yeah. I like it for consumer-use cases. Like, I’ve used it for online shopping. I think especially because a lot of websites require using a visual browser because it will have a search filter or something that it needs to go through or the model actually needs to be able to see what the item looks like. And then also for planning events, it’s been pretty useful.
Sonya Huang: What’s your favorite shopping query?
Isa Fulford: I think I was using it for clothes shopping.
Sonya Huang: [laughs] Love it. And you guys also showed us a really cool use case right before we filmed this episode. Do you want to share that one?
Isa Fulford: Yeah, sure. So that was actually something that one of our coworkers shared with us. She asked the agent to estimate OpenAI’s valuation and create—based on things that it found online, create a financial model with projections. So it creates a spreadsheet, also creates a summary analysis, and then also creates a slide deck presenting the results. And so hopefully the model is correct because it had quite an ambitious projection for us.
Sonya Huang: It was an impressive slide deck.
Isa Fulford: Yeah, it was a good slide deck.
Casey Chu: One thing I want to point out about this trajectory was that it reasoned for, I think, 28 minutes. And yeah, I think this is kind of opening up a new paradigm where you ask the agent for a task and then you step away and it comes back with a report. And yeah, I think as agents become more agentic, it’ll be longer and longer tasks. And this is a good example of one.
Sonya Huang: Are these the longest running tasks you guys have launched so far?
Casey Chu: I would say so. Like, I just did one that was an hour long, and I don’t think I’ve ever seen that.
Isa Fulford: I didn’t know how long Codex can run for.
Casey Chu: That’s true, yeah.
Sonya Huang: Is there anything special that goes into making an agent run for so long without kind of flying off the rails?
Edward Sun: We have some tools to enable the model to be able to further extend its context lens beyond what’s the original, like the harder limit, so that the model is able to perform tasks by documenting what it’s doing, and step by step kind of increase the time it can do, the horizon of the task it can do without the humans’ interruption.
Lauren Reeder: It’s also the flow to go back and forth between the model and the human also is very nice. So I can correct it as it’s going, right?
Isa Fulford: Yeah, so this model is very flexible and collaborative, and that was very important to us. So it’s modeled after how you would interact with someone if you ask them to do a task for you. So imagine you’re asking someone on Slack to do something for you, you’d probably give them instructions and then they’d ask you some questions, and then maybe start doing the task. And then maybe in the middle of the task they’ll say, “Oh, actually, can you clarify this for me?” Or “Can you sign into this thing for me or am I allowed to do this for you?” And similarly, you might remember something that you forgot to say when you first gave them the task, and you might want to interrupt them and just say, “Oh, hey. Please also do this.” Or you might want a status update if they’re taking a long time to do it, or you might want to redirect them if they’re going on the wrong path.
So that’s kind of what we modeled it after, and I think it’s very important that the user and agent are both able to initiate communication with each other. So I think what we have now is probably the most basic version of what this could be, but it’s better than anything we’ve released before in this area because at first the model can—or the agent can ask you clarifying questions similar to Deep Research, but it’s more flexible so it doesn’t always ask you clarifying questions. And then you can interrupt the model. So you can say, “Oh, can you summarize what you’ve done so far?” Or, “Oh, I forgot to say I actually only want blue sneakers.” And then if the model is going to take some kind of destructive action or if it needs you to log into something, it will also ask the user if it’s allowed to do that before doing anything.
Casey Chu: On this topic, we kind of built this kind of computer interface—you guys saw it—where you can kind of watch along with what the agent is doing. And that actually persists for beyond the conversation. So, like, once it’s done with the task, you can actually go back and ask it follow-up questions, and ask it to fix something or do another task. And you can also take over that computer. So you can click in, and then now you have access to its environment and you can, like, click for it or log in for it or insert your credit card information or things like that. And so yeah, I like to think of it as, like, looking over your coworkers’ shoulders and being able to take over if necessary.
Sonya Huang: Thank you, Agent for enabling the micromanager in me.
Lauren Reeder: [laughs]
Sonya Huang: Just kidding.
Lauren Reeder: So we’d love to talk a little about how this works to the extent that you can share.
Edward Sun: Yeah, so this agent is trained with the same technique as o1 with reinforcement learning. So we give the agent model all the tools we have implemented in the same virtual machine, like a text browser, like a GUI browser terminal and the imaging tool. And then the model will try to solve the task that we created, which are pretty hard tasks that the model has to complete using these tools. And then we reward the model if the model completes the task efficiently and correctly.
And for example, after this training, the model is able to—it should learn to switch between these tools fluently. For example, if you ask the model to research some restaurants and maybe booking a spot for you, it will first do a Deep Research-style text-based browsing, and then we’ll probably also use the GUI browser to view the images of the food and also view the availability, which is usually written in JavaScript that you have to use a real GUI browser. And then, for example, if you ask it to create an artifact, it usually can pull sources from a website and then use them in the terminal.
Isa Fulford: Yeah, I think the cool thing about this tool compared to tool use implementation in the past is that all of the tools have shared state, so it’s like when you’re using your computer and you have many different applications, you know, like, if you download something, it’s going to be accessible to other applications. It’s very similar. So the model can open a page in the text browser, which is more efficient, but then maybe it realizes it needs the visual browser, so it can just seamlessly switch, or it could download something using the browser and then in terminal it manipulates it or something like that. It can run something in terminal and then open it in the browser. It’s very flexible. And so it’s just giving the model a more powerful way of interacting with the internet and files in its file system and code and things like that.
Casey Chu: Yeah. And one interesting thing to emphasize is that, like, we essentially give the model all these tools, and then lock it in the room and then it experiments. You know, we don’t really tell it when to use what tool. It kind of figures that out by itself. It’s kind of almost magic.
Sonya Huang: Is the technique—it sounds very similar to Deep Research. We had you on the podcast before. Should we think about this as the standard technique of how OpenAI thinks that agents will be trained going forward?
Isa Fulford: I think we can take this really far. Our teams haven’t been collaborating for that long. We even framed this model run as kind of minimum shippable de-risk. That was mostly for PR reasons internally, but this is really the most basic version we could make together. And I think we have so much further we could push this with these methods. For example, the slides capability is a new capability. It’s already impressive. It’s great work from Aiden, Paloma, Martin, a bunch of other people. But there’s so much further we can push that and improve using the same techniques. But I think we can take it further, but we probably need other things too.
Edward Sun: Yeah, I feel so far it’s pretty magical. Like, the same IO algorithm just works like o1, reasoning, like Deep Research with tool calls. And then now a more advanced computer use, browser use agent.
Lauren Reeder: Where does it run into the limits with this strategy, and with this model specifically as well?
Isa Fulford: I think the interesting thing with this model is that because it’s able to take actions with external side effects, there’s a lot more risk. So for Deep Research, it was read only, so there’s kind of a limit to what the model could do in terms of data exfiltration and other things. But with this, in theory, the model could successfully complete a task, but take a lot of harmful actions along the way. Like, you could ask it to buy you something, and it decides to buy just like a hundred different options to make sure that you’re satisfied.
Lauren Reeder: Go on a shopping spree.
Isa Fulford: Exactly. Or you can think of many examples like that. So I think that safety and safety training and mitigations was kind of one of the really core parts of our process with this model. And yeah, maybe Casey can talk more about it.
Casey Chu: I was going to mention that kind of along the same lines, it’s like this contact with the real world that makes things difficult. You know, we have to train this on a bunch of VMs, it’s like thousands of VMs, maybe. And things break, and as soon as you’re hitting a real website, the website’s down or you’re hitting all these capacity limits and load testing and this kind of thing. Yeah, it’s really the very beginning, and we’re going to iron out all these details and continue, but that’s a major limitation.
Sonya Huang: How do you think about it from the safety perspective, building in the right guardrails, and how do I make sure the model’s not, you know, logging into my bank account and sending it all off to a Nigerian prince?
Casey Chu: Yeah, that’s a very good question. Yeah, this is definitely an emerging risk where, you know, the internet’s a scary place. There are a lot of attackers and scammers and this kind of thing, phishing attacks, like, the list goes on and on. And yeah, our model is a bit like—it can reason about these things like if you tell it to be careful. You know, we’ve done some safety training to make this more robust, but sometimes it can get fooled, and sometimes it is a bit too overeager to complete your task. We have a long list of mitigations, and the team has worked really hard to stack together a bunch of techniques to really try to make the model as safe as possible.
So one example that I’ll call out is that we have a monitor that kind of looks over its shoulder and just sees if anything looks funny, like whether it’s going on a weird website or anything like this—kind of like antivirus for your computer. It’s, like, just kind of persistently watching. And then if it looks like there’s anything suspicious, then it’ll stop the trajectory and stop there. Of course, we can’t catch everything, and this is a major area that we’ll continue to iterate on. We do have a protocol for if there are new attacks in the wild that we discover or we encounter, then we can rapidly respond and update these monitors kind of like you would update your antivirus software, it would pick up on these new attacks and hopefully keep you safe.
Isa Fulford: Yeah, I think the cool thing about the safety training is that it’s been a really cross-org effort from the safety team, governance team, legal team, research team, engineering team, like, so many others. And we have so many mitigations at every single level. We did a lot of external red teaming, internal red teaming, but yeah, as Casey mentioned, surely when we release the model, there will be new things we uncover. So we just need to make sure we also have robust ways of detecting those and then mitigating those.
Lauren Reeder: For some of these models, there’s a risk of what you can do with the models, whether it’s creating biohazards or otherwise. How do you guys manage some of that?
Casey Chu: Yeah, it’s actually, bio has been heavily on our mind. Yeah, the team has been really thoughtful about, you know, yeah, this agent, we think it’s very powerful. It can do research, it can really speed up your work, but that also means that it could speed up harm. And kind of one of the top things that our team has been looking into is the risk of bio-risk, so creating bioweapons, this kind of thing. And yeah, the team has been really thoughtful about how to mitigate against this, and generally being very cautious. We did, like, many weeks of red teaming to make sure that this model could not be used for those harms, and a bunch of other mitigations in place. Shout out to Karen who spearheaded this effort. And yeah, in general, I think we’re very aware of this and just trying to be very cautious.
Lauren Reeder: Yeah, makes sense. Tell us a little about the team that came together to build this.
Isa Fulford: So as Casey mentioned earlier, we had Deep Research team and then Deep Research applied team and Operator research team, or computer user agent research team and Operator applied team. And we effectively merged everybody. We all work really closely together, both the research team and the applied team.
Casey Chu: And the vibes have been great.
Isa Fulford: It’s been so fun! [laughs]
Casey Chu: Isa and I have been friends for a long time. So, like, it was a natural, like—it was great.
Isa Fulford: It was very fun.
Sonya Huang: How many of you are there?
Isa Fulford: On Deep Research, for the majority of the time, three or four. Now we have some new people, which is very exciting. And then on CUA …
Casey Chu: On CUA, I think around six to eight, somewhere around there.
Isa Fulford: On the research side. And then we have an amazing applied team. So, like, engineering product design led by Yash Kumar, and then he has this really crack engineering team. So it’s been very fun to work really closely. I think that’s one thing that has made this collaboration really special is that the research and applied teams work so closely. And even from the beginning when we’re defining what the product should be able to do, it’s very much a collaboration between research and product and design. So we go backwards from the use cases we want to be able to solve to training the model and building the product. And obviously it’s able to do—it’s not able to do all of those things fully yet, and it can do some things that we didn’t plan, but I think it’s a good framework for us when we’re starting a project. It’s very grounded in how we want people to use it in the real world.
Sonya Huang: It’s a way smaller team than I was expecting. Small teams can do amazing things.
Lauren Reeder: Yes, you’ve built a lot.
Isa Fulford: Yeah, and we haven’t been working together for very long.
Casey Chu: It’s been a few months.
Isa Fulford: Yeah.
Edward Sun: And actually the boundaries between the research team and the applied team are not, you know, like, very specific because, you know, during the model training, like a lot of applied engineers, they are helping us train the model. And also after we, you know, changed the model, some research team members are also working on the new set up of the model and the deploy the model to the real users.
Sonya Huang: What was the hardest part about training this agent?
Edward Sun: Yeah, I think one of the biggest challenges we have is how to make training stable, especially given, you know, like when we train Deep Research, it’s only using browsing and Python. It’s pretty, you know, mature tools there. Like, we’ve been using it for a while, but when training the agent model, it has some new tools like computer and also the terminal like a bundle in the same container, in the same virtual machine as a computer. So it’s actually quite hard to change because we are literally setting up hundreds of thousands of virtual machines at the same time, and then they all visit the internet. So it’s one of the biggest challenges. We see actually the training sometimes will fail, but finally we are very happy that we get this model trained.
Sonya Huang: So VMs.
Lauren Reeder: [laughs] All back to the engineering. Tell us about what’s next. More sources, more tools, better model? How do you think about it?
Casey Chu: Well, I think one thing I like about our agent framing is that you can ask it to do whatever you want, and you can ask it to do, like, every possible task you can imagine. It just might not do it well.
Sonya Huang: Could you tell it, like, “Go make me money on the internet?”
Casey Chu: You can tell it that.
Isa Fulford: It’ll try.
Casey Chu: It’ll try.
Sonya Huang: Should we try that right after this?
Casey Chu: [laughs] Let’s do it. But yeah, I think it’s really a matter of, like, improving the accuracy, the performance of tasks of, like, the whole distribution of tasks.
Isa Fulford: That anyone does on a computer.
Casey Chu: Right.
Isa Fulford: Which is a lot of tasks.
Edward Sun: And yeah, and so this is like an iterative deployment. We are very excited to see, like, what’s the new capabilities that our user will, you know, find in our agent, you know, like the coding ability in Deep Research or, you know, Deep Research ability in Operator.
Isa Fulford: You were using the agent mode for coding.
Edward Sun: Yeah, I use it for coding a lot because I feel it’s actually not always trying to rewrite my whole code base. It just actually has some small editing, and also it actually reads the original docs of different functions pretty well. So I feel it hallucinates less on the function calling.
Lauren Reeder: Oh, interesting. How do you choose when to go to Codex versus when to go to agent for that?
Edward Sun: For the agent, I get it more similarly to how I use it for o3. So it’s more like an interactive experience. For the Codex, it’s more like you have some well-designed problems that you want a co-worker to solve, and then it will make a PR for you. But for the agent, it’s more like just give you a function or give you a suggestion.
Lauren Reeder: Cool.
Isa Fulford: And it can do code search because it can access GitHub through the API connector, so code search kind of things.
Lauren Reeder: Yeah.
Sonya Huang: It almost feels like, you know, the agent roadmap up until now, you’ve built the different almost like appendages of what it would take to have an agent. And by combining them all, this really is like the first fully embodied agent on the computer. I think it’s very exciting.
Isa Fulford: Yeah, I think another area that we’re excited to push on is the experience of collaborating with the agent. I think this model is actually very good at multi-turn conversations, and it’s very nice to continue working on a task with it. I think that’s one of the deficiencies of Deep Research. A lot of people will do multiple Deep Research requests in a single conversation, but it doesn’t always work so well. So I think we’re really happy with this model’s multi-turn ability, and we just want to improve even further. And then I also think personalization and memory for agents will also be very important. And also right now, every agent task is initiated by the user, but in future it should also be doing things for you without you having to even ask in the first place.
Casey Chu: Yeah, I’m also pretty excited about the UI and UX surrounding the agent. Because right now I think, you know, obviously we’re working in a ChatGPT world. It’s like you start a conversation and it goes. So you can imagine a lot of different modes of interacting with an agent, and I’m really excited to explore different ways of interacting with the agent.
Sonya Huang: Do you see this as always being a kind of single omniscient super agent, or will there be the financial analyst sub-agent and the personal party planner sub-agent? Like, what’s your vision for how that kind of plays out?
Isa Fulford: I think people have different opinions on this. I think in the limit, if you could just ask one thing and it can figure out what it needs to do to finish the thing that you want it to do for you, that seems like it would be easiest. Like, if you just had a really amazing chief of staff who knows how to route things correctly and basically can do anything you need, that seems like it would be pretty easy.
Casey Chu: I think I agree with that take. And even in some of our trajectories where, I don’t know, you’re asking about, I don’t know, maybe a shopping task, sometimes it’ll go into terminal and do some calculations, you know, budget. And I think the model should be free to use all the tools it wants. It doesn’t need to be a financial analyst to have the financial analyst tool set.
Edward Sun: Yeah. I feel like when you launch the product, it sometimes makes sense to have some GPTs, like a customized model or customized instruction to put the model into a specific role. But in general, like, when training the model, there are lots of positive transfers between deep research co-operations, also slice generation. Like, all of these skills are transferable. So it makes much more sense to just have a single agent, like, you know, as an underlying base model.
Sonya Huang: Totally. I guess even though, you know, people do different types of work, we’re all fundamentally—we’re sending emails, we’re making slide decks, we’re doing a lot of the same work in front of a computer. I’d love to understand some of the learnings from the reinforcement learning perspective. Like, it seems like that’s the method that seems to really be working for you guys with agents. Was it very data intensive to kind of get to this point of having an agent that’s so good at such a wide variety of tasks, or what were some of the learnings from an RL perspective?
Edward Sun: Yeah. So actually, we created a bunch of very diverse sets of tasks, like some tasks to find some very niche topics or very niche answers in the internet, or some tasks just very similar to Deep Research, like you need to write a whole 4 D list article. And also lots of tasks, like, just all the tasks that we want the model to be good at. And yeah, and so far we think that as long as you can grade this task where after the model gives you a result, you can judge whether the roll-outs or the model’s performance is good or not, you can kind of reliably train the model to be even better on this task.
Lauren Reeder: Was there anything special you needed to do to make sure it had good turn-by-turn interaction with users when doing that training? Or was it just about the type of trajectories you collected?
Edward Sun: Yes. So I think most of the time we focused on end-to-end performance, like from a way of specifying the prompt how to accomplish the task.
Lauren Reeder: And somehow it’s very good at working with users.
Isa Fulford: To your question, the reinforcement learning is very data efficient, so that means that we’re able to curate a much smaller set of very high quality data. The scale of the data is just so minuscule compared to the scale of pre-training data, so we’re able to teach the model new capabilities by just curating these much smaller high quality data sets.
Casey Chu: I will say to get the operator piece to work well, you know, before we do RL, the model has to be good enough to have a basic completion of tasks. And our team has spent a lot of time in the past, over the past two, maybe three years, getting the model to that point where it’s able to actually reason about a page and, like, kind of understand the visual elements really well. So this model is built on all that as well.
Sonya Huang: Actually, could you say a little bit more about that? Because I remember early days of OpenAI, this was always part of the World of Bits stuff and you’re trying to RL the mouse paths, and it was just way too unbounded of a problem. What’s changed now for that to be solvable?
Casey Chu: Yeah, that’s great that you point out the World of Bits. This project does have a very long lineage dating back to 2017 or so. Actually, our code name is “World of Bits 2” for the computer use part.
Sonya Huang: [laughs] That’s awesome.
Casey Chu: And yeah, what’s changed? I think essentially the scale of the training has changed. I don’t know the multiplier, but it must be like 100,000x or something in terms of compute, the amount of training data we’ve done both in pre-training and RL. So yeah, I really think it’s just scale, and the scale catching up to our ambition, I guess.
Sonya Huang: Wow. Scale is all you need.
Casey Chu: I believe it.
Lauren Reeder: And some good data. [laughs]
Sonya Huang: Are there particular capabilities or functionalities that you’re especially excited about in agent mode?
Edward Sun: Yeah, so this model is actually pretty good at doing some real research, like data science. And also, you know, summarize the reports or the findings in a spreadsheet. So we have some evaluation like data science bench. We evaluate the model, and it actually outperforms the human baseline. So in some sense, it’s actually superhuman in some research tasks. We can rely on the model to perform some basic analysis for us.
Isa Fulford: And this is an area that Jon Blackman on our team was really pushing on, like, spreadsheets and data science, so shout out, Jon.
Sonya Huang: Spreadsheets and data science. You are automating us out of a job over here. Elevating us out of a job.
Isa Fulford: Elevating, yeah. Enhancing.
Sonya Huang: [laughs]
Casey Chu: Another thing I’m excited about is, you know, we released Operator in January, and, you know, it was decent at clicking around, but I think we’ve substantially improved that capability where it’s like much more accurate, and just kind of getting the basic things right is what I’m actually excited about, where it can reliably fill out a form and, you know, do those kind of things.
Isa Fulford: Date picking?
Casey Chu: Date picking? Date picking still needs a bit of work.
Isa Fulford: For some reason, date picking is just the hardest task.
Sonya Huang: It’s hard for a human, too!
Isa Fulford: Like, picking a date in the …
Lauren Reeder: The calendar dropdown?
Isa Fulford: Yeah.
Sonya Huang: Okay, last question. It seems like you guys have the overall framework and structure in place for something really interesting here. What’s ahead? Where do you go from here?
Isa Fulford: I think the thing that we’re really excited about is that this tool that we’ve given the model access to is very general. It’s basically most of what you could do on a computer. And if you think about all of the tasks that a human can do on a computer, it’s very extensive. And so now we kind of feel like it’s a matter of us making the model good at all of those tasks too, and figuring out a way of training on as diverse as tasks as possible with this very general tool. So I think there’s a lot of hard work ahead of us, but we’re very excited about it. I think we’re also excited about pushing different forms, ways of interacting with the agent. I think there’ll be a lot of new interaction paradigms between users and virtual assistants or agents. So a lot of exciting times ahead.
Lauren Reeder: I can’t wait to see it. Thank you.
Sonya Huang: Thanks for joining us. Congratulations on the launch.
Casey Chu: Thank you so much.
Isa Fulford: Thank you for having us.
Mentioned in this episode:
By navigating this website you agree to our cookie policy.