What's the best chatbot for me? Researchers put LLMs through their … – Nature.com
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Elizabeth M. Humphries is a postdoctoral fellow in the Fred Hutchinson Cancer Center Data Science Lab in Seattle, Washington.
You can also search for this author in PubMed Google Scholar
Carrie Wright is a senior staff scientist in the Fred Hutchinson Cancer Center Data Science Lab in Seattle,
You can also search for this author in PubMed Google Scholar
Ava M. Hoffman is a senior staff scientist in the Fred Hutchinson Cancer Center Data Science Lab in Seattle, Washington.
You can also search for this author in PubMed Google Scholar
Candace Savonen is a data scientist in the in the Fred Hutchinson Cancer Center Data Science Lab in Seattle, Washington.
You can also search for this author in PubMed Google Scholar
Jeffrey T. Leek is vice-president and chief data officer of the Fred Hutchinson Cancer Center and director of the Fred Hutchinson Cancer Center Data Science Lab in Seattle, Washington.
You can also search for this author in PubMed Google Scholar
You have full access to this article via your institution.
Data scientist Rumman Chowdhury (centre) advises students tasked with breaking artificial-intelligence chatbots during a competition in July.Credit: Marvin Joseph/The Washington Post via Getty
The widely hyped and controversial large language models (LLMs) — better known as artificial intelligence (AI) chatbots — are becoming indispensable aids for coding, writing, teaching and more. Their growing popularity has been matched by an increase in user-friendly options that are accessible through Internet browsers. By our count, there are at least eight major options, and even more niche ones; you might have even tried a few. But you probably haven’t had time to systematically test your prompts on several bots at once, so you might not be getting the most out of them.
To better match tools with applications, we tested eight popular browser-based LLMs in formal and casual writing, text and tone editing, and programming tasks. These LLMs were trained on different data and have different ‘personalities’ and approaches to answering questions. We spent a shocking amount of time and energy managing the frustration that comes with poorly written text and confusing AI-generated code in our search for the best collaborator. In the end, you will have to balance their strengths and weaknesses to find the perfect match.
NatureTech
NatureTech
Here we provide a quick summary of our (non-quantitative, non-scientific) impressions of each chatbot’s behaviour (see ‘Which chatbot is right for you?’).
Google’s Bard AI is fun to use. In our experience, it offers the most human-like responses, probably because its training data contained less formal communication, including posts on social media and online discussion boards. For instance, we asked Bard what its zodiac sign might be if it were human. It said that, on the basis of when it went live, it would be a Virgo. It also responded with “I don’t know” instead of a wrong answer more frequently than did other chatbots. However, it struggled when asked specific programming questions. Bard is a great tool for changing the tone of your writing to be more approachable to lay audiences and for writing and refining e-mails, or if you want to interact with a bot that has a natural style of speaking.
Claude, developed by the start-up company Anthropic in San Francisco, California, has a conversational style but feels more formal than Bard. It also has the best grasp of wordplay. In our testing, Claude (which is available in two forms: Claude-instant and Claude 2) was the only LLM that could reliably suggest titles or acronyms that made sense, and we have used it to name several projects. We also liked how it advises on changing the tone and formality of a writing sample for different audiences. Claude is particularly good at summarizing written text and performed well at writing code.
Most people who have dabbled with LLMs have probably tried ChatGPT-3.5 or the updated version, ChatGPT-4 — made by OpenAI in San Francisco. Another option is Sage, from ThoughtSpot in Mountain View, California; it was built using the GPT architecture but was trained on different data. All three performed similarly. These bots have the most straightforward communication style of those we tested. ChatGPT will always give an answer, but sometimes the answer is incorrect. It also sometimes invents references1. And it doesn’t always change its answers substantially when corrected by the user.
Carrie Wright, Candace Savonen, Ava Hoffman and Elizabeth Humphries (left to right) have investigated how large language models can be applied to science.Credit: Carrie Wright and Clifton McKee
ChatGPT-3.5 and ChatGPT-4 can offer extra context in their answers without being asked to do so, and are great places to start when planning a project or document. When it comes to editing your writing, ChatGPT-4 performs better because it doesn’t smooth away the underlying message as ChatGPT-3.5 occasionally does.
Phind is different from its competitors: it was designed to answer software-development questions and excels at that task. We especially liked how it includes links to posts on online forums and blogs that cover the same sort of programming issue as that in your query. Phind also works well as a general search engine. However, when it comes to writing text, it sometimes copies directly from its source material, so watch for plagiarism. But do keep Phind in mind if you have specific programming questions, or if you want Wikipedia-like information.
Llama, from Meta in Menlo Park, California, has become available to the general public only in the past few months. So far, we haven’t found it to be all that different from its competitors. It will answer hypothetical questions as Bard does, and seems to provide code that works with minimal debugging.
The personality differences between the LLMs are well illustrated by the answers that each bot gave to a popular get-to-know-you question: what fictional character do you identify with the most? Bard engaged the way we expected it to: its answer was the android Data from Star Trek: The Next Generation, because Data is an AI that is intelligent, curious, always learning and trying to understand what it means to be human.
Claude and ChatGPT interpreted the question literally and answered that, as AI language models, they do not have emotions or experiences and cannot identify with fictional characters. Claude added that, although it has no independent sense of self, other LLMs might have been programmed with personalities that were modelled after those of certain characters. ChatGPT followed its denial with an offer to provide information about specific fictional characters.
Similarly, Phind said that it was an AI bot and did not identify with a fictional character, but its answer included a list of popular fictional characters with whom people often identify, as well as links to lists such as the ‘Top 120 Iconic Fictional Characters’. We encountered similar results when asking the bots for their Hogwarts houses from the Harry Potter series, zodiac signs and personality types from popular tests, such as Myers–Briggs.
Llama answered that it was an AI bot but did offer several characters with which it might share characteristics. However, when we changed the question to, “If you were human, what fictional character would you most identify with?” Llama replied Sherlock Holmes, because he is highly analytical and detail oriented.
Whichever LLM you choose, if you want to keep your long-term relationship functional and happy, consider these tips.
First, patience and refinement are key. Your queries need to be clear about the output you want and provide enough context for the LLM to work with. Expect some back-and-forth. It might take more time to communicate well to the LLM than it would to do the task yourself, so think carefully about where you want to spend your effort.
Second, test everything. All LLMs are fallible, so double-checking what they tell you is a must, whether that involves testing suggested code, verifying citations or making sure the basic facts are right. Most LLMs have been trained on data that are biased in some way, so their answers can be biased as well. And chatbots can and do change over time — for instance, Bard’s developers say that the chatbot will be the first LLM to admit how confident it is in its response.
Finally, the importance of human decision-making when using AI cannot be underestimated: LLMs might be poised to change how we work, but they still are only as good as the humans in front of the keyboard.
Bard
• Made by Google.
• Free.
• Can access current information on the Internet.
• Admits when it cannot answer your query.
• Does not provide sources for information unless prompted.
• Requires very specific prompts.
• Might interpret code incorrectly.
ChatGPT-3.5
• Made by OpenAI; also accessible through Poe by Quora.
• Free.
• Cannot access the Internet (and thus has no access to information past 2021).
• Writes reasonable (if sometimes inaccurate) code in several programming languages, and can debug and optimize code.
• Generates fluent English text with extensive detail.
• Prone to inventing non-existent sources and articles.
• Mixes accurate and inaccurate statements.
ChatGPT-4.0
• Made by OpenAI; also accessible through Poe by Quora.
• Requires a subscription. (Poe’s implementation provides one free query per day.)
• Cannot access the Internet.
• More transparent than ChatGPT-3.5 about the limitations of its training data.
• Better than ChatGPT-3.5 at retrieving real citations.
• Better than ChatGPT-3.5 at refining supplied text without losing the main message.
• Struggles to retrieve certain types of citation (such as conference abstracts).
Llama
• Made by Meta.
• Accessible through Poe by Quora.
• Free.
• Can access information on the Internet.
• Writes reasonable code in several programming languages (however that code can be difficult to parse).
Phind
• Made by Phind.
• Formerly called Hello.
• Free.
• Can access current information on the Internet.
• Provides multiple solutions to coding questions in a single answer.
• Provides links to the blog posts and forums that its answers come from.
• Not designed for applications outside software development.
• Prone to plagiarism.
• Has difficulty answering questions that cannot be easily found on the Internet.
• Little to no information online about how it was created or trained.
Assistant
• Made by OpenAI (GPT-3.5 architecture).
• Accessible through Poe by Quora.
• Free.
• Cannot access the Internet.
• Designed for language translation, summarization and answering questions.
• Can write and debug code in multiple programming languages.
• Can generate fluid English text and provide reasonable edits and suggestions to existing writing.
• Provides sparse supporting information on generated code, such as what each line means.
• Mixes accurate and inaccurate statements.
Claude-instant
• Made by Anthropic.
• Accessible through Poe by Quora.
• Free.
• Includes multiple interface options, including Slack.
• Can write and edit English text and provide extensive detail when asked.
• Can write and edit code in several programming languages, and offer software-development advice.
• Good at adapting text to different levels of expertise.
• Mixes accurate and inaccurate statements.
Claude 2
• Made by Anthropic.
• Accessible through Poe by Quora.
• Poe’s implementation provides a few free queries each day; more than that requires a subscription.
• Can write and edit text in several programming languages.
• The quality of its performance is about the same as that of Claude-instant.
• Mixes accurate and inaccurate statements.
Some previously tested bots (NeevaAI, Dragonfly) are no longer available to use.
doi: https://doi.org/10.1038/d41586-023-03023-4
This is an article from the Nature Careers Community, a place for Nature readers to share their professional experiences and advice. Guest posts are encouraged.
Ziwei, J. et al. ACM Comput. Surv. 55, 248 (2023).
Article Google Scholar
Download references
J.T.L. teaches Coursera courses that cover topics in AI, which generate revenue; is a co-founder of a company, Synthesize Bio, that uses AI but does not develop LLMs; and is a co-foudner of a Papr, a company that is developing an app for rapid peer review.
Could AI help you to write your next paper?
What ChatGPT and generative AI mean for science
Six tips for better coding with ChatGPT
AI predicts how many earthquake aftershocks will strike — and their strength
News
Can AI predict who will win a Nobel Prize?
News
AI and science: what 1,600 researchers think
News Feature
AI and science: what 1,600 researchers think
News Feature
Science and the new age of AI
News Feature
How to stop AI deepfakes from sinking society — and science
News Feature
Digging up ancient animals in Amazonia
Spotlight
Boomerang academics: why we left academia for industry, but then came back
Career Guide
My double life as a cell biologist and crime writer
Career Q&A
The Department of Microbiology invites applications for the endowed tenure-track position of Ohio Eminent Scholar in Industrial Microbiology.
Columbus, Ohio
The Ohio State University, Department of Microbiology
Houston, Texas (US)
Baylor College of Medicine (BCM)
currently seeking multiple exceptional Principal Investigators & Postdoctoral Fellows to conduct innovative research in the field of life sciences.
Beijing, China
Beijing Frontier Research Center for Biological Structure, Tsinghua University
The Center for Evolutionary & Organismal Biology invites applications from evolutionary scientists for All ranks.
Hangzhou, Zhejiang, China
Center for Evolutionary & Organismal Biology, Zhejiang University
10 fully-funded PhD positions in the field of animal conservation and cryobiology are offered in the new EU HORIZON-MSCA-Doctoral Network CryoStore.
Norway (NO)
CryoStore
You have full access to this article via your institution.
Could AI help you to write your next paper?
What ChatGPT and generative AI mean for science
Six tips for better coding with ChatGPT
An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.
© 2023 Springer Nature Limited