Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot
    The tech world is sleeping on the most exciting Bluetooth feature in years

    The tech world is sleeping on the most exciting Bluetooth feature in years

    December 6, 2025
    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    December 6, 2025
    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    December 6, 2025
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be
    Business

    Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

    News RoomBy News RoomOctober 15, 20244 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email
    Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

    For a while now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.

    The fragility highlighted in these new results helps support previous research suggesting that LLMs’ use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. “Current LLMs are not capable of genuine logical reasoning,” the researchers hypothesize based on these results. “Instead, they attempt to replicate the reasoning steps observed in their training data.”

    Mix It Up

    In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”—currently available as a preprint paper—the six Apple researchers start with GSM8K’s standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs’ complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values—so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation.

    This approach helps avoid any potential “data contamination” that can result from the static GSM8K questions being fed directly into an AI model’s training data. At the same time, these incidental changes don’t alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.

    Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.

    This kind of variance—both within different GSM-Symbolic runs and compared to GSM8K results—is more than a little surprising since, as the researchers point out, “the overall reasoning steps needed to solve a question remain the same.” The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any “formal” reasoning but are instead “attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.”

    Don’t Get Distracted

    Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using “formal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).

    The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that “five of them [the kiwis] were a bit smaller than average.”

    Adding in these red herrings led to what the researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple “pattern matching” to “convert statements to operations without truly understanding their meaning,” the researchers write.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleThe FCC is looking into the impact of broadband data caps and why they still exist
    Next Article Here’s a bunch of bananas shit Trump said today about breaking up Google

    Related Posts

    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    December 4, 2025
    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    December 4, 2025
    ByteDance and DeepSeek Are Placing Very Different AI Bets

    ByteDance and DeepSeek Are Placing Very Different AI Bets

    December 4, 2025
    Jeff Bezos’ New AI Venture Quietly Acquired an Agentic Computing Startup

    Jeff Bezos’ New AI Venture Quietly Acquired an Agentic Computing Startup

    December 4, 2025
    Melinda French Gates on Secrets: ‘Live a Truthful Life, Then You Don’t Have Any’

    Melinda French Gates on Secrets: ‘Live a Truthful Life, Then You Don’t Have Any’

    December 2, 2025
    WIRED Roundup: Gemini 3 Release, Nvidia Earnings, Epstein Files Fallout

    WIRED Roundup: Gemini 3 Release, Nvidia Earnings, Epstein Files Fallout

    December 2, 2025
    Our Picks
    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    December 6, 2025
    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    December 6, 2025
    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    December 6, 2025
    The Oceans Are Going to Rise—but When?

    The Oceans Are Going to Rise—but When?

    December 6, 2025
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    Taste the Future With the Best Meal Replacement Shakes Gear

    Taste the Future With the Best Meal Replacement Shakes

    By News RoomDecember 6, 2025

    An underreported phenomenon in modern culture is the slow degradation of our collective optimism about…

    The best Christmas gifts we love under

    The best Christmas gifts we love under $50

    December 5, 2025
    One week at the Luigi Mangione media circus

    One week at the Luigi Mangione media circus

    December 5, 2025
    You can now use Pixel phones as a Switch 2 webcam

    You can now use Pixel phones as a Switch 2 webcam

    December 5, 2025
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2025 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.