| | | | | | Apple recently published a paper arguing that current LLMs lack true logical reasoning abilities and are merely mimicking reasoning patterns from their training data. Your friendly AI Human, Amit Raja Naik, is back to tell you there’s more to the story. | | | | | | | | | | | | Mehrdad Farajtabar, a senior author of the paper, posted on X, explaining how the team came to the conclusion. According to him, LLMs just follow sophisticated patterns and even models smaller than 3 billion parameters are hitting benchmarks that only larger ones could do earlier, specifically the GSM8K score released by OpenAI three years ago. The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and thus, not reliable to test the reasoning capabilities of LLMs. Surprisingly, OpenAI’s o1 demonstrated “strong performance on various reasoning and knowledge-based benchmarks” according to the researchers, but the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions. | | | | | | | This proves that the “reasoning” capabilities of OpenAI’s models are definitely getting better, and maybe GPT-5 would be a step ahead. Maybe it’s just Apple’s LLMs that don’t reason well, but the team didn’t test out Apple’s model. Also, not everyone is happy with the research paper as it fails to even explain what “reasoning” actually means and just introduces a new benchmark for evaluating LLMs. | | | | | | | The Tech World Goes Bonkers | | | | | “Overall, we found no evidence of formal reasoning in language models…their behaviour is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” Farajtabar further said that scaling these models would just result in ‘better pattern machines’ but not ‘better reasoners’. Some people have been claiming all along that LLMs cannot reason and are an off road to AGI. Maybe, Apple has finally accepted this after trying out LLMs on their products and this is possibly also one of the reasons why it backed out of its investment in OpenAI. Most researchers have been praising the paper and believe that it is important for others as well to accept that LLMs cannot reason. Gary Marcus, a long-standing critic of LLMs, shared several examples of LLMs not able to perform reasoning tasks such as calculation and playing Chess. On the other hand, a problem with Apple’s paper is that it has confused reasoning with computation. “Reasoning is knowing an algorithm to solve a problem, not solving all of it in your head,” said Paras Chopra, an AI researcher, while explaining that most of the LLMs know the approach to solving a problem even though they end up with the wrong answer in the end. According to him, knowing the approach is good enough to check if the LLM is reasoning, even if the answer is wrong. Discussions on Hacker News highlight that the Apple researchers were trying to do a “gotcha!” on the LLMs by including irrelevant information in some of the questions, which the LLMs would not be able to actively filter out.
Reasoning is the progressive, iterative reduction of informational entropy in a knowledge domain. OpenAI’s o1-preview does that better by introducing iteration. It’s not perfect, but it does it. | | | | | | | But is This True? Do LLMs Not Reason? | | | | | Subbarao Kambhampati, a computer science and AI professor at ASU, agreed that some of the claims of LLMs being capable of reasoning are exaggerated. However, he said that LLMs require more tools to handle System 2 tasks (reasoning), for which, techniques like ‘fine-tuning’ or ‘Chain of Thought’ are not adequate. When OpenAI released o1, claiming that the model thinks and reasons, Clem Delangue, the CEO of Hugging Face, was not impressed. “Once again, an AI system is not ‘thinking’, it’s ‘processing’, ‘running predictions’,… just like Google or computers do,” said Delangue, when talking about how OpenAI is painting a false picture of what the company’s newest model can achieve. While some agreed, others argued that it is exactly how human brains work as well. “Once again, human minds aren’t ‘thinking’, they are just executing a complex series of bio-chemical/bio-electrical computing operations at massive scale,” replied Phillip Rhodes to Delangue. To test reasoning, some people also ask LLMs how many Rs are there in the word ‘Strawberry’, which does not make sense at all. LLMs can’t count letters directly because they process text in chunks called “tokens”. The tests for reasoning have been problematic in the case of LLMs ever since they were created. Everyone seems to have strong opinions on LLMs. While some are grounded in research by experts such as Yann LeCun or Francois Chollet, arguing that LLM research should be taken a bit more seriously, others just follow the hype and criticise it. Some say they’re our ticket to the AGI, while others think they’re just glorified text-producing algorithms with a fancy name. Meanwhile, Andrej Karpathy recently said that the technique of predicting the next token that these LLMs, or Transformers, use might be able to solve a lot of problems outside the realm of where it is being used right now. While it seems true to some extent that LLMs can reason, when it comes to putting them to the test, they end up failing it. Enjoy the full story here. | | | | | Claude 3.5 Brushes Off Canvas with a Stroke of Code | | | | | | | Recently, OpenAI unveiled canvas, a new interface for working with ChatGPT on writing and coding projects. Many wonder if it’s better than Claude Sonnet 3.5 Artifacts. The answer is no. Here’s why: canvas uses GPT-4o, and it is not better at coding than Claude Sonnet 3.5. While canvas has some good features for developers, like user collaboration and version control, it lacks critical features like code preview. Read the full story here. | | | | | | | | | HCLTech recently announced that its GenAI suite ‘AI Force’ secured 25 clients, driving service transformation through integrations with Microsoft GitHub Copilot, AWS, and Google Cloud. It also outlined a three-pronged AI strategy fueling innovation and a growing generative AI pipeline across verticals and geographies. Adobe unveiled GenStudio for performance marketing, integrating generative AI to streamline content creation, enhance personalised campaigns, and expand cross-platform activations with Meta, Google, Microsoft, Snap, and TikTok. OpenAI launched Swarm, an open-source framework for building multi-agent systems with lightweight coordination and granular control, alongside MLE-bench, a benchmark to evaluate AI agents on machine learning engineering tasks. | | | | | Fraud Detection and Prevention in the AI Era | | | | | | | LLMs are powerful tools for detecting financial fraud—but they’re also creating new risks. Are your defences ready? Discover how to stay ahead with proactive strategies! | | | | | | | | | Join the NVIDIA AI Summit India from October 23–25, 2024, at the Jio World Convention Centre in Mumbai to explore AI innovations across generative AI, robotics, supercomputing, and more, with 70% of use cases addressing India's grand challenges. Don't miss the Fireside Chat with NVIDIA CEO Jensen Huang on October 24. | | | | | | | | | AIM is bringing Cypher 2024 to Silicon Valley! And we are all geared up, having just wrapped up the biggest AI conference in Bengaluru last month. Where: Santa Clara, California. When: November 21-22, 2024 | | | | | Enjoying Sector 6 (formerly AIM Daily XO)? Share it with colleagues or friends – they can sign up here. We love hearing from our readers! Have thoughts on our new format? Questions, comments, or ideas are always welcome. If there’s a specific topic in AI or analytics that you're curious about, tell us! Reach out to us at info@analyticsindiamag.com. Stay tuned for more insights in our next edition!
Curated with ♥️ in Namma Bengaluru | | | | | | | | |
Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.