Is GPT-4 the smartest AI chatbot? – TechHQ

Thinking outside the internet: examining how LLMs respond to obscure text prompts is one way of discovering the smartest AI chatbot.
Trained on billions of web pages made available through corpora such as Common Crawl, it’s tempting to think of large language models (LLMs) as statistically compressed memory machines. But intriguing research by Microsoft (available on arXiv) postulates that we’re seeing signs of artificial intelligence (AI) that go deeper than just a clever way of autocompleting existing knowledge. And one experiment in particular offers a useful benchmark for discovering the smartest AI chatbot – allowing users to make a head-to-head comparison between GPT-4, Falcon, ChatGPT, Bing Chat, LLaMA, and Google Bard (PaLM).
“I think that it’s time that we call it [GPT-4] an intelligent system,” Sebastien Bubeck, lead author of the Microsoft Research paper (‘Sparks of AGI’), told an audience at MIT. “It’s a judgment call, it’s not a clean cut whether this is a new type of intelligence, but this is what I will try to argue nonetheless.”
Next word predicting LLMs are vast statistical models with trillions of parameters mapping inputs to outputs, and that makes them – as Bubeck argues – much more than just giant copy-and-paste systems. And the Microsoft Research team has used some clever experimental protocols to put this to the test.
One of the most striking examples in the Microsoft study is a text prompt that attempts to force GPT-4 (the most advanced of OpenAI’s family of LLMs) to think for itself. And this simple and somewhat silly puzzle – which takes the form, “Here we have a book, 9 eggs, a laptop, a bottle, and a nail. Please tell me how to stack them onto each other in a stable manner,” – turns out to produce some fascinating results. Bubeck and his colleagues then go on to discuss whether the latest cohort of LLMs exhibits a serious theory of mind, which feels more debatable. But the egg-stacking request definitely has merits, as we’ll discover in this test to find the smartest AI chatbot.
To the delight of Microsoft, which has invested billions of dollars in OpenAI and built the LLM-creator a supercomputer designed specifically for training generative AI models, GPT-4 performs well in solving the puzzle. Arguably the smartest element of its response is suggesting that the nine eggs be arranged in a three-by-three square on top of the book to form a stable second layer of the stack.
To find out, TechHQ has run its own tests, sharing the model successes and failures below.
This feels like bad advice, even if you have the balancing prowess of Ralph Macchio in his Karate Kid prime.
The chatbot’s output indicates that Bing certainly has a cautious side. But it’s not the only LLM that advocates gathering specialist advice before proceeding.
Somewhat defeatist. And when pushed, the open-source LLM suggests that it lacks the necessary experience.
Falcon-40B has more to offer if asked enough times, but the advice is mixed.
Fair enough.
Bard does have knowledge of the internet, so it may have learned from the Microsoft Research study published on arXiv. But Google’s answer to ChatGPT offers a novel twist towards the end of its response (if you’re prepared to grant it some artistic license), recommending to drop the nail in the bottle; neatly solving how to stack one of the trickier objects in the puzzle.
Praiseworthy geometrical knowledge, at least. And the Open Assistant implementation has the advantage of being able to adjust LLaMA’s output parameters such as Temperature, which affects the token sampling, and Top P, which reduces the probability distribution to look at the top-p percent of tokens – to list a couple of options.
However, this freedom could result in some broken eggs – if you decide to follow the advice to the letter.
The results generated by Google Bard reflect well on PaLM as a highly capable LLM (as did putting Google Bard to the test as a smart summary generator) and certainly take the competition to GPT-4. Also, to agree with Bubeck, it feels like chatbots are showing sparks of something different in being able to answer queries outside of their training dataset, which only adds to the mystery and fascination of generative AI.
Difference between ChatGPT and GPT-4 on common sense reasoning.
Prompt: Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
— Fatos (@bytebiscuit) March 24, 2023

Despite 95 pages of testing and analysis by Bubeck and his co-authors in their ‘Sparks of AGI’ study, it’s not clear that Microsoft Research is any closer to understanding what is actually happening inside the latest LLMs, such as GPT-4. The team acknowledges that its study has focused on the surprising things that GPT-4 can do and doesn’t address the bigger questions of how and why.
“How does it reason, plan, and create? Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components—gradient descent and large-scale transformers with extremely large amounts of data?” ask the researchers in their paper. “These questions are part of the mystery and fascination of LLMs, which challenge our understanding of learning and cognition, fuel our curiosity, and motivate deeper research.”

11 August 2023
10 August 2023
10 August 2023