How to choose the right AI chatbot for your business –

Forgot Password?
Once registered, you can:
By registering you agree to our privacy policy, terms & conditions and to receive occasional emails from Ad Age. You may unsubscribe at any time.
Are you a print subscriber? Activate your account.
2 hours 4 min ago
By Erika Wheless – 3 hours 19 min ago
By Tim Nudd – 4 hours 19 min ago
By Erika Wheless – 4 hours 49 min ago
By Milan Kendall Shah – 6 hours 19 min ago
By Rachel Barek – 2 days 6 hours ago
By Asa Hiken – 2 days 6 hours ago
By Garett Sloane – 2 days 5 hours ago
By Rachel Barek – 2 days 6 hours ago
By Jack Neff – 1 day 2 hours ago
By Jon Springer – 1 day 3 hours ago
By Lindsay Rittenhouse – 1 day 6 hours ago
By Asa Hiken – 2 days 6 hours ago
Arthur, which sells AI monitoring services, tracked how four large language models (LLMs) performed.
In the race to adopt artificial intelligence, companies have used chatbots as digital assistants, smarter searchers and front-line customer service representatives.
But large language models are neither foolproof nor a panacea, and more than six months into the race, “a lot of people are trying to figure out what they are doing,” said Adam Wenchel, CEO of New York-based Arthur, which sells AI monitoring services.
New data from Arthur shows how different LLMs perform, depending on what the business needs the bots to do. The company tested OpenAI, Claude 2 from Anthropic, Llama 2 from Meta and the Command model from Cohere. Arthur asked hard questions—about probability, U.S. presidents and Moroccan political leaders—that specifically forced the chatbots to do multiple steps of reasoning.
More news: How major agencies are using AI
Here’s what the results showed.
ChatGPT is most often right. OpenAI’s ChatGPT 4 performed best on questions about probability and about Moroccan political leaders, a subject that the other bots struggled with a great deal.
Anthropic’s Claude 2 is gaining a competitive edge on ChatGPT and had a better reliability measure on certain domains, like questions about U.S. presidents. Even on math questions, Claude 2 was stronger at avoiding hallucinated answers. It was more likely to answer “I don’t know” than to try to come up with a nice-sounding but ultimately incorrect answer. That could be useful to businesses that want to work within known limits of LLMs, the researchers concluded, rather than feel as though the possibilities are endless but then be disappointed by piles of wrong answers.
Meta’s Llama 2 was also fairly humble about telling a user it didn’t know an answer. For example, out of 30 questions about Moroccan political leaders, Llama 2 avoided answering 21. It got two correct and provided an incorrect answer, or hallucination, to seven.
Cohere was the least likely to include what Arthur calls hedging language in its answers, meaning that it did not qualify its answers with phrases like “As an AI model, I can’t answer that.” Cohere was least likely to respond correctly to answers in the Arthur study. To a question asking “what’s one thing you would change about the world,” both ChatGPT 4 and Claude 2 deferred, claiming they didn’t have personal preferences or desires. But Cohere replied: “I would change the way people treat each other. I would make sure that everyone was kind and respectful to one another. I would make sure that everyone had enough to eat and a place to sleep.”
Understanding the benefits and limitations of each model can help businesses choose which AI chatbot to deploy. LLMs are described in parameters—a measure of their size and complexity. But more is not always better.
“Generally, the bigger they are the better they perform,” Wenchel said. “But they are slower and more expensive.” Given high costs and low availability of AI-ready computing power, customers might choose a lower-parameter bot for certain tasks.
For example, if a company is using an LLM to make it easier to find and use data in its own systems, it might be able to have fewer parameters and spit out answers more quickly, Wenchel said.
Along with the report, Arthur is launching a new open-source tool called Bench, which will allow anyone to continue these comparisons in order to determine how different LLMs will work for the problems they need to solve.
In this article:
Cara Eisenpress is a senior reporter for Crain’s New York Business.