Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    How AI can make us better decision-makers, with Cassie Kozyrkov

    July 14, 2025

    Microsoft will halt new Office features for Windows 10 in 2026

    July 14, 2025

    The Garmin Forerunner 970 Celebrates Your Race Finish With You

    July 14, 2025
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Meet The AI Agent With Multiple Personalities
    Business

    Meet The AI Agent With Multiple Personalities

    News RoomBy News RoomApril 17, 20253 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email

    In the coming years, agents are widely expected to take over more and more chores on behalf of humans, including using computers and smartphones. For now, though, they’re too error prone to be much use.

    A new agent called S2, created by the startup Simular AI, combines frontier models with models specialized for using computers. The agent achieves state-of-the-art performance on tasks like using apps and manipulating files—and suggests that turning to different models in different situations may help agents advance.

    “Computer-using agents are different from large language models and different from coding,” says Ang Li, cofounder and CEO of Simular. “It’s a different type of problem.”

    In Simular’s approach, a powerful general-purpose AI model, like OpenAI’s GPT-4o or Anthropic’s Claude 3.7, is used to reason about how best to complete the task at hand—while smaller open source models step in for tasks like interpreting web pages.

    Li, who was a researcher at Google DeepMind before founding Simular in 2023, explains that large language models excel at planning but aren’t as good at recognizing the elements of a graphical user interface.

    S2 is designed to learn from experience with an external memory module that records actions and user feedback and uses those recordings to improve future actions.

    On particularly complex tasks, S2 performs better than any other model on OSWorld, a benchmark that measures an agent’s ability to use a computer operating system.

    For example, S2 can complete 34.5 percent of tasks that involve 50 steps, beating OpenAI’s Operator, which can complete 32 percent. Similarly, S2 scores 50 percent on AndroidWorld, a benchmark for smartphone-using agents, while the next best agent scores 46 percent.

    Victor Zhong, a computer scientist at the University of Waterloo in Canada and one of the creators of OSWorld, believes that future big AI models may incorporate training data that helps them understand the visual world and make sense of graphical user interfaces.

    “This will help agents navigate GUIs with much higher precision,” Zhong says. “I think in the meantime, before such fundamental breakthroughs, state-of-the-art systems will resemble Simular in that they combine multiple models to patch the limitations of single models.”

    To prepare for this column, I used Simular to book flights and scour Amazon for deals, and it seemed better than some of the open source agents I tried last year, including AutoGen and vimGPT.

    But even the smartest AI agents are, it seems, still troubled by edge cases and occasionally exhibit odd behavior. In one instance, when I asked S2 to help find contact information for the researchers behind OSWorld, the agent got stuck in a loop hopping between the project page and the login for OSWorld’s Discord.

    OSWorld’s benchmarks show why agents remain more hype than reality for now. While humans can complete 72 percent of OSWorld tasks, agents are foiled 38 percent of the time on complex tasks. That said, when the benchmark was introduced in April 2024, the best agent could complete only 12 percent of the tasks.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleGoogle loses adtech monopoly case
    Next Article Framework Laptop 13 (2025) review: getting better with age

    Related Posts

    Tornado Cash Made Crypto Anonymous. Now One of Its Creators Faces Trial

    July 11, 2025

    A New Kind of AI Model Lets Data Owners Take Control

    July 11, 2025

    Linda Yaccarino Tried to Tame X. Now She’s Out as CEO

    July 10, 2025

    ‘People Are Going to Die’: A Malnutrition Crisis Looms in the Wake of USAID Cuts

    July 10, 2025

    Grok Is Spewing Antisemitic Garbage on X

    July 9, 2025

    OpenAI Poaches 4 High-Ranking Engineers From Tesla, xAI, and Meta

    July 9, 2025
    Our Picks

    Microsoft will halt new Office features for Windows 10 in 2026

    July 14, 2025

    The Garmin Forerunner 970 Celebrates Your Race Finish With You

    July 14, 2025

    Where are the iPhone’s WebKit-less browsers?

    July 14, 2025

    CBP Wants New Tech to Search for Hidden Data on Seized Phones

    July 14, 2025
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    News

    Google exec: ‘We’re going to be combining ChromeOS and Android’

    By News RoomJuly 14, 2025

    Google’s head of Android has said that the company plans to combine its mobile operating…

    Conspiracy Theories About the Texas Floods Lead to Death Threats

    July 14, 2025

    LG’s Lightweight Gram Pro 16 Laptop Still Needs Some Work

    July 14, 2025

    The Timekettle T1 Is an Adept Global Translator That’ll Work Even Offline

    July 13, 2025
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2025 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.