News Flash, Buddy Apple’s recent stab at AI, Apple Intelligence, has largely been disappointing.
But releasing its own AI model sounds especially reckless when you consider that Apple engineers warned about the tech’s gaping deficiencies.
The yet-to-be-peer-reviewed work, which tested the mathematical “reasoning” of some of the industry’s top LLMs, added to the consensus that AI models don’t actually reason.
Math Is Hard To test the AI models, the researchers had them attempt thousands of math problems from the widely used benchmark GSM8K dataset.
The way the researchers exposed these gaps in the AI models was shockingly easy: they simply changed the numbers in the questions.
And yet they made it public.
Buddy, here’s the news.
Apple Intelligence, the company’s recent attempt at AI, has mostly failed. Specifically, its news summaries were so widely criticized for reporting inaccurate information and creating clumsy headlines that Apple halted the entire program this week until it could be fixed.
Nothing here should come as a surprise. All large language models have this type of AI “hallucinations” problem, which no one has yet to resolve, if it can be resolved at all. However, given that Apple engineers had warned about the technology’s glaring flaws, releasing its own AI model seems particularly careless.
The study that issued that warning was published in October of last year. The work, which has not yet undergone peer review, tested the mathematical “reasoning” of some of the leading LLMs in the field and contributed to the general agreement that AI models are not capable of reasoning.
According to the researchers’ conclusion, “instead,” they try to duplicate the steps of reasoning that they saw in their training data. “.”.
Math is challenging.
The researchers used thousands of math problems from the popular benchmark GSM8K dataset to test the AI models. An example question would be: “James purchases five 4-pound packs of beef. It costs $5.50 per pound to purchase beef. Some questions are a little trickier, but nothing that a middle school student with a good education can’t figure out. “How much did he pay?”.
By simply altering the question numbers, the researchers were able to reveal these flaws in the AI models in a surprisingly simple manner. By preventing data contamination, this ensures that the AIs have never encountered these particular issues in their training data without actually making them more difficult.
All 20 tested LLMs experienced a slight but noticeable decrease in accuracy as a result of this alone. In their own words, the performance drop was “catastrophic”—as high as 65 percent—when the researchers went one step further and changed the names and added extraneous information, such as mentioning that a few fruits were “smaller than usual” in a question about counting fruits.
These differed from model to model, but even OpenAI’s o1-preview, which was the most intelligent of the group, fell by 17.5 percent. Its predecessor, GPT-4o, experienced a 32% decline. ).
Emulate Cat.
Therefore, the conclusion is harsh.
“This reveals a critical flaw in the models’ ability to discern relevant information for problem-solving, likely because their reasoning is not formal in the common sense term and is mostly based on pattern matching,” the experts wrote.
In other words, AI excels at seeming intelligent and will frequently provide the correct response, but it suffers greatly when it is unable to replicate someone’s assignment verbatim.
This should make you seriously doubt your ability to trust an AI model to recite headlines by changing words without truly comprehending how doing so alters the meaning of the text, but it doesn’t. Apple released its own model despite being aware of the significant flaws observed in every LLM to date. which, to be fair, is how the AI industry as a whole operates.
Read more about AI: A terrible new startup uses AI agents to post a ton of negative reviews of its clients’ products on Reddit.