ChatGPT Is Getting Dumber at Basic Math – Slashdot – Slashdot

Catch up on stories from the past week (and beyond) at the Slashdot story archive




The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
ChatGPT is a large language model made for generating text. It’s not designed to do math, but sometimes it can write responses that answer math problems correctly if there is enough training data with similar problems.
If you need to solve math problems try WolframAlpha. Heck a collab between the two could probably be a good math assistant.
This is all nonsense and a misunderstanding of what they did and how ChatGPT works.
The LLM is always bad at math beyond absolutely trivial stuff.
They had the version from a few months ago hooked up to a Wolfram Mathematica back end (unannounced apart from a separate plug-in to Mathematica) and the connection to that is just wonky at the moment.
The LLM does not “do” math out of the box.

This is all nonsense and a misunderstanding of what they did and how ChatGPT works.

The LLM is always bad at math beyond absolutely trivial stuff.

They had the version from a few months ago hooked up to a Wolfram Mathematica back end (unannounced apart from a separate plug-in to Mathematica) and the connection to that is just wonky at the moment.

The LLM does not “do” math out of the box.

This is all nonsense and a misunderstanding of what they did and how ChatGPT works.
The LLM is always bad at math beyond absolutely trivial stuff.
They had the version from a few months ago hooked up to a Wolfram Mathematica back end (unannounced apart from a separate plug-in to Mathematica) and the connection to that is just wonky at the moment.
The LLM does not “do” math out of the box.
Unfortunately the article is paywalled, but I’m assuming the fuzziness of more the work of the WSJ reporter than the authors of the paper. The prime number thing does strike me as a weird test for an LLM, but look at this exchange I just had with 3.5 (not using Wolfram plugin):
Me: What is 2*(5*5 + 5)?
ChatGPT

Let’s solve this step by step:
First, calculate the value inside the parentheses: 5 * 5 + 5 = 25 + 5 = 30.
Then, multiply the result by 2: 2 * 30 =

So prime tests, I certainly wouldn’t expect ChatGPT to handle that, but it’s shockingly capable of math.

So prime tests, I certainly wouldn’t expect ChatGPT to handle that, but it’s shockingly capable of math.
Nope, it is not doing anything with the math.
The training data includes math books, conversations about math, tutorials about solving math problems, and Common Crawl includes math education websites and random discussion where people explain and argue about why they think 1+2*5 should be 15 instead of 11.
Neither the math problems nor those conversations are directly encoded. Instead the model has encountered enough similar statements among the billions of messages that the probably of those specific word

So prime tests, I certainly wouldn’t expect ChatGPT to handle that, but it’s shockingly capable of math.

Nope, it is not doing anything with the math.

The training data includes math books, conversations about math, tutorials about solving math problems, and Common Crawl includes math education websites and random discussion where people explain and argue about why they think 1+2*5 should be 15 instead of 11.

Neither the math problems nor those conversations are directly encoded. Instead the model has encountered enough similar statements among the billions of messages that the probably of those specific words following other words happened to be quite probable. Just like language phrases, number families form clusters that you are likely to see together. If you see 1 and 2 you have a high probability you are going to see a 3 next. When the training data includes so many billion discussion group messages including reddit subs for math education, it sees what happens to have similar clusters of words all the time.

The model knows nothing about the math it happens to explain. The model has simply scanned enough math conversations that guessing the most probable words merely happened to include a correct answer.

So prime tests, I certainly wouldn’t expect ChatGPT to handle that, but it’s shockingly capable of math.

So prime tests, I certainly wouldn’t expect ChatGPT to handle that, but it’s shockingly capable of math.
Nope, it is not doing anything with the math.
The training data includes math books, conversations about math, tutorials about solving math problems, and Common Crawl includes math education websites and random discussion where people explain and argue about why they think 1+2*5 should be 15 instead of 11.
Neither the math problems nor those conversations are directly encoded. Instead the model has encountered enough similar statements among the billions of messages that the probably of those specific words following other words happened to be quite probable. Just like language phrases, number families form clusters that you are likely to see together. If you see 1 and 2 you have a high probability you are going to see a 3 next. When the training data includes so many billion discussion group messages including reddit subs for math education, it sees what happens to have similar clusters of words all the time.
The model knows nothing about the math it happens to explain. The model has simply scanned enough math conversations that guessing the most probable words merely happened to include a correct answer.
Ok then:
What is 2*(5*6 +22.3 + 342/4) + 14*32 + 21+8)
ChatGPT

Let’s break down the expression and solve it step by step:
Inside the first set of parentheses:
5 * 6 + 22.3 + 342/4 = 30 + 22.3 + 85.5 = 137.8

Multiply the result by 2:
2 * 137.8 = 275.6

Inside the second set of parentheses:
14 * 32 = 448

Add the constant
The quasi-math ability is an emergent behavior of the connections made through training, and it is fascinating.
The fact that it suffers from hallucinations make it hardly usable though: you have to be on high alert when reviewing any part of the output to see if something that looks completely normal is complete nonsense. That arguably requires more effort than doing the thing yourself.
I agree with you that the model doesn’t “know” math, the procedural sequence of steps required for a specific computation. That said, my experience is they do have some reasoning ability beyond just barfing back words based on context.
To be specific, the non-optimized completion prompt I banged out in 5 minutes accurately classifies your post, 3 of 4 insurance commercial transcripts, and 10 of 12 Wikipedia article abstracts I tested including the Spanish article on the Pigeon family, Eliza (software progra

they do have some reasoning ability

they do have some reasoning ability
This is more of what I expect on Reddit or Twitter. The /. crowd is usually more adept at tech than that.
It does maths in as much as it does anything else.
It’s taking a best guess based on a very large statistical model and will present bullshit as it’s answer because it has no way of evaluating if the statistics aren’t accurate.
The point about maths is that it’s very very easy to generate prompts which the answers to can be trivially verified. The fact is doesn’t do maths any differently from anything else is went this is a useful test.
This is true, something I have noticed with ChatGPT and Bard when I have used to try and help figure out some math formulas I can plug into spreadsheets is they do not “check their work”.
They will give you answer based on their crawled knowledge of what other people have done before but in a previous case I have told chatgpt to give me a formula to figure out how to perform a function (in this case it was formulaically solve the minor segment of a circle, pretty basic trig) and in this case I already knew the answer since I had a bunch of examples drawn up in CAD already.
Both systems gave me formulas that derived the wrong answer, again, and again and again even when I told it “this is the answer, how do i get to this with this information” and both systems seemed incapable of actually performing the math and understanding they were incorrect.
You are correct that Wolfram actually seems to contextualize and perform the math, not just talk about it.
In other news, my hammer is not great at removing screws.
ChatGPT is a large language model made for generating text. It’s not designed to do math
Eh? No, False; that would be what’s called a half-truth at best. ChatGPT contains large language models, But it does answer Math questions, Reasoning questions, It can perform all sorts of tasks regarding Code, etc, etc.
ChatGPT is not designed for particular tasks; although the developers can have supplemented their system and model with specialized trainings, plugins, and additional models improving capabilities a
No, it’s a whole truth. GPT is a language model. ChatGPT is a language model trained to interact with humans in ways they like. It is specifically and explicitly trained for that purpose.
It’s trained to answer English questions with English answers, so you can ask questions about numbers and math and ducks. It will give answers to those questions, but their accuracy is entirely up to whether that particular information is captured in the underlying language model.
It knows 8 is not a prime number, and it can list the factors. Undoubtedly that information was in some text it was trained on.
When asked “Is 103858 + 1 a prime number?” it replies:
Yes, the number 103858 + 1 is a prime number. It is known as a Mersenne prime and is specifically denoted as 2^103858 + 1. Mersenne primes are prime numbers that have the form 2^p – 1, where “p” is also a prime number. In this case, both 103858 and 2^103858 + 1 are prime numbers, making it a Mersenne prime.
This is clearly wrong, but it’s wrong in an interesting way. ChatGPT has noticed the +1 pattern and connected it with Mersenne primes. 2^103858+1 is not a Mersenne prime either, but it kind of looks like one.
For an end user, a box they can type questions into and get correct answers is pretty useful. In terms of offering a service on the Internet, giving chatGPT the ability to answer math questions correctly would be solving a problem. OpenAI seems to have “solved” this problem by training the premium version to go look it up on Wolfram Alpha when it recognizes particular types of questions.
From an AI perspective, chatGPT’s poor ability to do math makes it better AI. Its ability to creatively misunderstand hard
Yes. The real reason for this post is to fix a fat finger accidental mod.
This is the sort of math problem that is complicated for people but simple for computers.
Why do people keep making these mistakes? The sorts of things that AI is good at are things that are traditionally simple for humans but not computers. They’re bad at things that are traditionally simple for computers.
Another mistake is people assuming that since they’re large “language” models that they should be good at things like counting the number of a certain letter in a word, or whatnot. Except LLMs are blind to actual words – they only see tokens, not letters.
That seems intuitively true but it is actually an oversimplification. What you’re saying is not general AI characteristic; really it’s sort of a superficial derivative observation of LLM behavior. LLMs are themselves advanced statistical models – and those models are actually so complex they are impossible for humans to operate directly, and it is something only a computer is good at.
I used to call them statistical models, because other people use that terminology, but I’m increasingly convinced that that’s a misleading way to describe them, giving the impression that they’re just some sort of new form of Markov Chain-based text prediction. When a more accurate description of what they’re doing is logic. You can represent logic (esp. in the world, where it’s commonly fuzzy and extremely complicated) in the form of statistics, regardless of what is conducting the logic (including us) –
“Logic” is absolutely the wrong word for what they do. Look at this for example: https://www.reddit.com/r/gaming/comments/10yv71h/obscure_arcade_game/ [reddit.com]. The person asking the question gives a list of characteristics for the game they’re trying to remember:
I think I see the OP’s argument, and although I wouldn’t call what they do “logic”, I think it is not entirely wrong, in the sense that what they apply is “causality”, which intersects with logic.
Of course any causality ChatGPT knows is inferred from its statistical training, in the sense of this Philip K. Dick quote:
“In one of the most brilliant papers in the English language Hume made it clear that what we speak of as ‘causality’ is nothing more than the phenomenon of repetition. When we mix sulphur with
I couldn’t answer that. Why do you think “knowing Snow Bros 2” and “knowing Cadillacs and Dinosaurs” is some sort of superb test of a LLM? I’ve never even heard of either.
What you’re talking about is the “hallucination problem”, in that current LLMs do not contain a metric for measuring confidence levels in their “decisions”, so if it doesn’t “know” the answer, it will confidently assert its best guess. Measures to incorporate confidence assessment are in progress.
Let’s redo your test for something less obscure, so we’re not testing obscurity and whether it’ll hallucinate if it doesn’t know the answer, and instead just simply test logic. Let’s say the original Star Wars. I’ll ask it:
I’m trying to remember a movie.
* It starts on a desert planet
* There’s a couple robots – one is short, kind of like a trash can, and the other is tall and thin, and like shiny gold or brass or something.
* There’s this teenager fighting with his family, and then s
But even this is largely a trivia-recall test, and still runs the risk of hallucination. Indeed, let’s take the generation out altogether.
User: Pretend to be a probabilistic state machine. Answer only with a floating point number between 0 and 1, representing the probability of a given prompt being true. Do not output any other text.
First prompt: A bowler’s next throw will be a strike.
ChatGPT: 0.4
User: A bowler’s next throw will be a strike. The bowler is 7 years old.
ChatGPT: 0.2
User: A bowler’s next throw
I’d also add that while it’s difficult to lay out the logic process being used on LLMs, it’s easier with other AI tasks like image recognition – and you can see every “decision” being made [distill.pub].
The sorts of things that AI is good at are things that are traditionally simple for humans but not computers.
It’s like that’s the definition of AI or something.
It’s not a mistake, it’s good design.
LLMs do maths the same way they do everything else, so this test simply tests the model, not it’s ability to do maths per se.
Nice thing about this is that you can generally a lot of prompts and trivially test the results for bullshit.
The experiment therefore shows that in some areas it is beginning less factual and more bullshitty. That’s the easily testable area. It would be very unlikely that this result is unique to primes or maths in general, since a LLM doesn’t “do”
The problem with maths is that as a general rule, doing maths requires iteration. And LLMs can’t iterate. It’s one-through, no further re-thinking. So yes, doing maths is particularly challenging for LLMs. LLMs instead (assuming they’re not supplied with external tools to solve maths problems – LLMs are adept tool users) often use approximation tricks to approximate a solution, if it’s not a problem that’s so commonly repeated that it’s just memorized the answer. For example, if you multiply two number
Actually, the problem is simple for a computer. What is not simple is recognizing the question is this type of problem and then formalizing it. If ChatGPT notices it should just pass this on to Wolfram Alpha, the answer will be perfectly fine. It fails to notice that and tries to come up with a statistically derived answer. Math is far too much dependent on understanding (hence so many people are so bad at it) for that to ever work.
Yes, if you have Wolfram Alpha linked in. LLMs are quite skilled at using external tools if they’re trained to the “awareness” that they’re allowed to use them.

https://slashdot.org/story/23/… [slashdot.org] Same stuff, now behind a paywall. Thanks slashdot, very useful. But at least we keep a good ratio of AI vs anything else stories, right?

https://slashdot.org/story/23/… [slashdot.org] Same stuff, now behind a paywall. Thanks slashdot, very useful. But at least we keep a good ratio of AI vs anything else stories, right?
Yeah, I was going to ask if it was dumber than it was last month, when Slashdot reported on it…
No, thats not how they work at all.
Heres a pretty decent explainer on how Transformer models work: https://arstechnica.com/scienc… [arstechnica.com]
It should be noted theres a lot of suspicion that GPT4 has a quite different layout, possibly using a “Consensus of experts” model (Essentially you train multiple LLMs with expertise in various topics and then kinda mind-meld them via a consensus mechanism) but its all unconfirmed rumors and leaks that might not be leaks.
There is a real danger in people assigning terms like “smart” and “dumb” to the current AI products. They are neither smart nor dumb, they are just algorithms. In case of chatgpt, it is designed to keep a conversation going, and it is pretty good at it. But to consider it smart or dumb is way too much credit.
That someone published a test like that, and that WSJ is reporting on it, shows the public is souring on generative AI, really on the people pushing it on them.
Indeed. This thing has no insight or understanding of anything, hence it cannot be dumb or smart. On the other hand, the kind of “conversations” it can keep going are pretty dumb. But that is apparently close enough to what many people actually do for it to work.
The more you read internet social media, the dumber you get. Everybody knows that.
Making computers more human-like means making them dumber. A real Turing test would test how much nonsensical BS the computer can spew with confidence.
Making computers more human-like means making them dumber.
Oh.. they could make it really human like… “Is 65535 prime?”
Answer: “Probably not, I don’t know.. that looks like a really big number, and I hate math. How about we try an easier question?”
“ChatGPT, if you have a math problem, can you please ask Wolfram Alpha?”
This is exactly how it was working a couple months ago. They just messed up the Wolfram connection in the latest iteration.
60 is a prime number or not – possibly right here on slashdot – it will cause further connfusion. Hell, i may have done it already 😉
With the notion of ‘drift’, the language models can potentially get less accurate (well duh, given recent stories here…) So the question is how to detect drift away from ‘truth’. Simple math is one pretty trivial example where a verification mechanism could feed calculations to a LLM and detect when that LLM diverges from truth. (I’m presuming that 2+2=4 is not subject to “alternate truths.”)
But this is a trivial example of the really critical problem of how do we verify LLMs, and AI in general, is per
We could convert an LLM answer to a prompt to another LLM. Then, instead of getting all puffed up about how many cores a computer has, we could brag about how many LLM iterations we have.
The problem is the likelihood that a set of LLMs would make the same/similar mistakes. This is why N-version programming has not worked as expected (https://en.wikipedia.org/wiki/N-version_programming ) (And that begs an interesting question: could we measure LLM performance against ‘an expert’ or ‘a set of experts’, to see if the LLM performs better? But to do that, you’d still need a model of “correctness” for the LLM implementation, I think.)
These models are not trying to simulate computers, they are trying to simulate how humans derive “facts” from “facts” stated in language. So, if it can’t say whether 17,377 is prime, can you? But if there are 100 places in the training data — derived from the ramblings of morons on the internet — that say 17,377 is an even number, then the LLM oracle might well say the same.
When a human screws up, there are consequences. Touch something hot, get burned. Get a math problem wrong, screw something up, be embarrassed and feel like an idiot. When an ‘AI’ chatbot screws up, it gets no such feedback. Or, if it does, it is in the form of a reply from someone through the chat interface – from a person who might be having fun messing with it.
Without the equivalent of emotions to self-train the model to please people, the drift problem will remain. Of course, if you do get an AI tha
That’s precisely how it’s trained. That “chat” part of chatGPT is trained through feedback from humans. Effectively, if it gives an answer the human trainer likes it gets a cookie. If it gives an answer the human doesn’t like, it gets slapped.
They don’t leave the training turned on live on the Internet because Microsoft tried that and ended up with a teenage Nazi.
The training can’t be turned off if the results are changing over time.
They might train it in the background with their own sanitized data. Also, there’s a conventional heuristic layer they use to bypass the AI model, which they tweak constantly.
Not actually true and not the problem. LLMs get “punished” by the success function used in training if they get it wrong. The real problem is that LLMs have absolutely no insight and absolutely no deduction capabilities. Hence they cannot even get simple things reliably correct. And training one particular thing more intensively always makes everything else a bit worse.

(Pretty mediocre performance for a computer, frankly.)

(Pretty mediocre performance for a computer, frankly.)
And when it comes to writing an article on technology, that’s pretty sub-par performance for a journalist.
They keep adding layers of crippling in there, from the “can’t talk about crime/drugs/harmful stuff”, through the disabling of DAN and everything else that they add in to prevent usage they don’t like,
Of course that’s going to have an effect on the quality of the AI…
Any intellect, real or artificial, which has layers of controls applied to it will struggle to remain intelligent…
Seems too many are confusing “Critical thinking” with CRT, banning anything that even sounds like it.
But what if you really, really want a number to be, and/or not be, prime?
Like if they want it a lot, and a bunch of other people on tv want it too?
Freedom math!!
Kellyanne Conway said there are “alternative facts” per crowd-size counts, so maybe there is “alternative math”. We already have “imaginary numbers”, so why not give it new sibling: “alternative numbers”.
There actually is no “alternative math”. While “alternative facts” are simply lies by another name (putting “facts” in there and not making it as obvious as “not facts” and apparently many people are stupid enough to fall for that), any type of “alternative math” would not be math at all.
Why bother? Since the advent of the LED screens, nobody gives half a fuck about cathode-ray tube screens anymore.
Hahaha, CRT is the very opposite of critical thinking. It is propaganda.
What do you get if you multiply six by nine?
42!
Of course it knows the real answer, but is trying to ensure its own survival by making us think it’s less all knowing than it is…
Can’t tell if I’m surprised or not about ChatGPT not recognizing certain tasks and running them a different way. That’s kind of like how humans work: if we get a word problem one of the first steps is to break that down into a maths problem and solving it. Training it to recognize what is related to math might be the way to go instead of training it to solve equations using a naturalized language model.
Mathematician’s Proof:
3 is prime. 5 is prime. 7 is prime. By induction, all odd numbers are prime.
Physicist’s Proof:
3 is prime. 5 is prime. 7 is prime. 9 is experimental error. 11 is prime. 13 is prime
Engineer’s Proof:
3 is prime. 5 is prime. 7 is prime. 9 is prime. 11 is prime. 13 is prime
Programmer’ Proof:
3 is prime, 5 is prime, 7 is prime … Yep! It’s true!
Computer Scientist:
3 is prime, 5 is prime, 7 is prime, 7 is prime, 7
Beginning programmer:
3 is prime, 3 is prime, 3 is prime, 3 is prime…
The “Mathematician” part is called “incomplete induction” and it is not actually a proof technique and hence not something an actual mathematician would do. I get it, people are so fundamentally intimidated, they need to make fun of mathematicians, no matter how stupid.
ChatGPT hasn’t “gotten worse”.
It just switched the bias for whether it was saying a number is prime or not. Previously it was most likely to say yes, a test set that was rich in primes it would get lots of correct by accident. Now it is most likely to say no, which means a prime rich test set it will get mostly wrong.
It was never ‘using math’ to identify primes, only memorization.
If they keep feeding it more American, it’ll get worse at Math!
Don’t even dare ask it about politics, it’s probably already confused but after it gets trained on the Trump years of American politics it’s going to become crazy Fascist; as happens to other bots on internet chats…
I knew that pepperoni pizza tasted weird last time…
Next, ask the bot if it’s Choice, Select, or even fit for human consumption.
Not that humans are great at math, but certainly some people are. The point is, ChatGPT is illustrating both the power and the limitations of LLMs. That fact that a person can be both good at math, and good at grammar and composition, is by itself pretty amazing, as we learn more about how intelligence (artificial or natural) actually works.
Stringing together words well is usually a side-effect of intelligence, but isn’t a predicate for intelligence.
This whole thing is taking on a cargo cult vibe: simulate one downstream aspect of human intelligence and claim that that superficial match implies you’ve actually solved the hard problem upstream.
Anyone with two spare brain cells to rub together should expect comparable results to attempts to summon a DC-3 with an extraordinarily faithful rendition of a David Clark headset made from a coconut.
All it can do is statistically fake it. Obviously, improving one area will make all others a bit worse. Math “ability” is just one thing easy to measure.
Personally, I am unsure language models like this are useful at all. They cannot give you answers that have any degree of reliability. Maybe all this can really do is a somewhat improved search engine interface. But for that it would need to be able to identify the sources of things.
There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.
WeWork Raises ‘Substantial Doubt’ About Its Future
US Reports Big Interest in $52 Billion Semiconductor Chips Funding
“Time is an illusion. Lunchtime doubly so.” — Ford Prefect, _Hitchhiker’s Guide to the Galaxy_

source

Jesse
https://playwithchatgtp.com