ChatGPT’s New Upgrade Teases AI’s Multimodal Future – IEEE Spectrum

The October 2023 issue of IEEE Spectrum is here!
IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.
OpenAI’s chatbot learns to carry a conversation—and expect competition
ChatGPT isn’t just a chatbot anymore.
OpenAI’s latest upgrade grants ChatGPT powerful new abilities that go beyond text. It can tell bedtime stories in its own AI voice, identify objects in photos, and respond to audio recordings. These capabilities represent the next big thing in AI: multimodal models.
“Multimodal is the next generation of these large models, where it can process not just text, but also images, audio, video, and even other modalities,” says Linxi “Jim” Fan, senior AI research scientist at Nvidia.
ChatGPT’s upgrade is a noteworthy example of a multimodal AI system. Instead of using a single AI model designed to work with a single form of input, like a large language model (LLM) or speech-to-voice model, multiple models work together to create a more cohesive AI tool.
“The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”
—Kyle Shannon, Storyvine
OpenAI provides three specific multimodal features. Users can prompt the chatbot with images or voice, as well as receive responses in one of five AI-generated voices. Image input is available on all platforms, while voice is limited to the ChatGPT app for Android and iOS.
A demo from OpenAI shows ChatGPT being used to adjust a bike seat. A befuddled cyclist first snaps a photo of their bike and asks for help lowering the seat, then follows up with photos of the bike’s user manual and a tool set. ChatGPT responds with text describing the best tool for the job and how to use it.

These multimodal features aren’t entirely new. GPT-4 launched with an understanding of image prompts in March 2023, which was put into practice by some OpenAI partners—including Microsoft’s Bing Chat. But tapping these features required API access, so it was generally reserved for partners and developers.
GPT4’s multimodal features appeared in Bing Chat in the summer of 2023. Microsoft
They’re now available to everyone willing to pay US $20 a month for a ChatGPT Plus subscription. And their synthesis with ChatGPT’s friendly interface is another perk. Image input is as simple as opening the app and tapping an icon to snap a photo.
Simplicity is multimodal AI’s killer feature. Current AI models for images, videos, and voice are impressive, but finding the right model for each task can be time-consuming, and moving data between models is a chore. Multimodal AI eliminates these problems. A user can prompt the AI agent with various media, then seamlessly switch between images, text, and voice prompts within the same conversation.
“This points to the future of these tools, where they can provide us almost anything we want in the moment,” says Kyle Shannon, founder and CEO of the AI video platform Storyvine. “The future of generative AI is hyper-personalization. This will happen for knowledge workers, creatives, and end users.”
ChatGPT’s support for image and voice is just a taste of what’s to come.
“While there aren’t any good models for it right now, in principle you can give it 3D data, or even something like digital smell data, and it can output images, videos, and even actions,” says Fan. “I do research at Nvidia on game AI and robotics, and multimodal models are critical for these efforts.”
Image and voice input is the natural start for ChatGPT’s multimodal capabilities. It’s a user-facing app, and these are two of the most common forms of data a user might want to use. But there’s no reason an AI model can’t train to address other forms of data, whether it’s an Excel spreadsheet, a 3D model, or a photograph with depth data.
That’s not to say it’s easy. Organizations looking to build multimodal AI face many challenges. The biggest, perhaps, is wrangling the vast sums of data required to train a roster of AI models.
“I think multimodal models will have roughly the same landscape as the current large language models,” says Fan. “It’s very capital-intense. And it’s probably even worse for multimodal, because consider how much data is in the images, and in the videos.”
That would seem to give the edge to ChatGPT and other well-heeled AI startups, such as Anthropic, creator of, which recently entered an agreement worth “up to 4 billion” with Amazon.

But it’s too soon to count out smaller organizations. Fan says research into multimodal AI is less mature than research into LLMs, leaving room for researchers to find new techniques. Shannon agrees and expects innovation from all sides, citing the rapid iteration and improvement of open-source large language models like Meta’s LLama 2.
“I think there will always be a pendulum between general [AI] tools and specialty tools,” says Shannon. “What changes is that now we have the possibility of truly general tools. The specialization can be a choice rather than a requirement.”
Matthew S. Smith is a freelance consumer-tech journalist. An avid gamer, he is a former staff editor at Digital Trends and is particularly fond of wearables, e-bikes, all things smartphone, and CES, which he has attended every year since 2009.