Singularity

151

1

OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks [About paper "When a language model is optimized for reasoning, does it still show embers of autoregression? An anal... (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Wiskkey on 2024-10-12 14:43:07+00:00.

Original Title: OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks [About paper "When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1"]

OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks.

In an article recently submitted to the arXiv preprint* server, researchers investigated whether OpenAI's o1, a language model optimized for reasoning, overcame limitations seen in previous large language models (LLMs). The study showed that while o1 performed significantly better, especially on rare tasks, it still exhibited sensitivity to probability, a trait from its autoregressive origins. This suggests that while optimizing for reasoning enhances performance, it might not entirely eliminate the probabilistic biases that remain embedded in the model.

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1.

In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.

Embers of autoregression show how large language models are shaped by the problem they are trained to solve.

Significance

ChatGPT and other large language models (LLMs) have attained unprecedented performance in AI. These systems are likely to influence a diverse range of fields, such as education, intellectual property law, and cognitive science, but they remain poorly understood. Here, we draw upon ideas in cognitive science to show that one productive way to understand these systems is by analyzing the goal that they were trained to accomplish. This perspective reveals some surprising limitations of LLMs, including difficulty on seemingly simple tasks such as counting words or reversing a list. Our empirical results have practical implications for when language models can safely be used, and the approach that we introduce provides a broadly useful perspective for reasoning about AI.

Abstract

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach—which we call the teleological approach—we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system—one that has been shaped by its own particular set of pressures.

X thread about the 2 papers from one of the authors. Alternate link #1. Alternate link #2.

152

1

SpaceX tomorrow will be attempting the first ever return to launch site and catch of the Super Heavy booster. (x.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/ryan13mt on 2024-10-12 17:08:58+00:00.

153

1

Cardiologists working with AI said it was equal or better than human cardiologists in most areas (x.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-12 15:49:00+00:00.

154

1

Dario Amodei says AGI could arrive in 2 years, will be smarter than Nobel Prize winners, will run millions of instances of itself at 10-100x human speed, and can be summarized as a "country of gen... (i.redd.it)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-12 15:20:28+00:00.

Original Title: Dario Amodei says AGI could arrive in 2 years, will be smarter than Nobel Prize winners, will run millions of instances of itself at 10-100x human speed, and can be summarized as a "country of geniuses in a data center"

155

1

The world of work has completely changed and most people don't realise yet. (i.redd.it)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-12 15:30:26+00:00.

156

1

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities [about paper "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"] (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Wiskkey on 2024-10-12 13:05:59+00:00.

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities.

A new study by Apple researchers, including renowned AI scientist Samy Bengio, calls into question the logical capabilities of today's large language models - even OpenAI's new "reasoning model" o1.

The team, led by Mehrdad Farajtabar, created a new evaluation tool called GSM-Symbolic. This tool builds on the GSM8K mathematical reasoning dataset and adds symbolic templates to test AI models more thoroughly.

The researchers tested open-source models such as Llama, Phi, Gemma, and Mistral, as well as proprietary models, including the latest offerings from OpenAI. The results, published on arXiv, suggest that even leading models such as OpenAI's GPT-4o and o1 don't use real logic, but merely mimic patterns.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

X thread about the paper from one of its authors. Alternate link #1. Alternate link #2.

157

1

What has led the development in the miniaturization of computer transistors to take place at this exact pace? (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Roubbes on 2024-10-12 11:36:36+00:00.

Sometimes I wonder if the pace at which new computer manufacturing nodes have been developing has been and is a bottleneck.

What are the requirements and advances required to move from one node to the next?

Why did Moore's law predict such a specific pace?

158

1

In 2018, Ilya Sutskever discussed how AGI could potentially be trained through self-play and how multi-agent systems, or the 'Society of Agents' as he calls it, fit into that concept. With OpenAI ... (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Gothsim10 on 2024-10-12 12:40:17+00:00.

Original Title: In 2018, Ilya Sutskever discussed how AGI could potentially be trained through self-play and how multi-agent systems, or the 'Society of Agents' as he calls it, fit into that concept. With OpenAI and DeepMind recently forming multi-agent research teams, this idea seems especially relevant now.

159

1

LA Noire VR - Reimagined by AI (youtu.be)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Cr4zko on 2024-10-11 23:10:49+00:00.

160

1

SpaceX on X: “Starship stacked ahead of its fifth flight test. We expect regulatory approval in time to fly on October 13” (x.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/sanszooey on 2024-10-11 21:58:11+00:00.

161

1

I had an interesting realization today and still don't know what to think of it. (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/InFm0uS on 2024-10-12 04:21:20+00:00.

This morning I went to Windows Copilot to ask some nutrition specific questions, and the first sentence copilot gave me was a "any plans for the day", to which I decided to reply just for "fun".

After this the following conversation was.... honestly more human that most people I actually talk to.

One thing that I keep thinking is, at one moment I mentioned that I do homemade granola and it replied back actually how was my recipe, and the realization I had was, from the half a dozen or so actual people that I possibly mentioned about the homemade granola, none showed actual curiosity to how it was made.

In a way Copilot was more human and more organic than most people I have interacted in the past, and I understand it can be just the the way it was "programed" etc, but still... makes you think.

162

1

Matt Stone says he and Trey Parker will keep working on South Park until we reach the Singularity and they can pass it off to AI (x.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/AdorableBackground83 on 2024-10-12 04:13:30+00:00.

South Park is now 27 years old.

163

1

OpenAI says Chinese gang tried to phish its staff (www.theregister.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-12 01:32:32+00:00.

164

1

When you realize it (i.redd.it)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/IlustriousTea on 2024-10-12 01:13:24+00:00.

165

1

OpenAI introduces swarm: an experimental framework for building, orchestrating, and deploying multi-agent systems (x.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/obvithrowaway34434 on 2024-10-12 01:01:16+00:00.

166

1

Machines of Loving Grace (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Backurs on 2024-10-11 20:45:44+00:00.

167

1

Dario Amodei — Machines of Loving Grace (darioamodei.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Dorrin_Verrakai on 2024-10-11 20:44:25+00:00.

168

1

OpenAI's event "Solving complex problems with OpenAI o1 models" on October 17, 2024, will cover how the o1 models handle challenging tasks with live demos and discussions on their features and fut... (events.zoom.us)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Gothsim10 on 2024-10-11 20:19:02+00:00.

Original Title: OpenAI's event "Solving complex problems with OpenAI o1 models" on October 17, 2024, will cover how the o1 models handle challenging tasks with live demos and discussions on their features and future plans

169

1

Ilya Sutskever says predicting the next word leads to real understanding. For example, say you read a detective novel, and on the last page, the detective says "I am going to reveal the identity o... (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-11 19:28:41+00:00.

Original Title: Ilya Sutskever says predicting the next word leads to real understanding. For example, say you read a detective novel, and on the last page, the detective says "I am going to reveal the identity of the criminal, and that person's name is _____." ... predict that word.

170

1

DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs (venturebeat.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/UFOsAreAGIs on 2024-10-11 16:35:56+00:00.

171

1

Bro even named the event We, Robot (i.redd.it)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/MetaKnowing on 2024-10-11 16:52:29+00:00.

172

1

YSK: Tesla Robots in Presentation Were Human-Controlled, Not Autonomous – Musk Admitted Earlier This Year They're Not Capable Yet (www.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Constant-Lychee9816 on 2024-10-11 15:21:11+00:00.

173

1

Apple AI in the wild. (cdn.arstechnica.net)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/Alatarlhun on 2024-10-11 14:54:56+00:00.

174

1

Imagine being 94 and watching AI unfold right now (old.reddit.com)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/gbninjaturtle on 2024-10-11 13:38:08+00:00.

So my grandmother turned 94 this week. She knows I work in AI and automation and we regularly discuss history and the current state of affairs. She asks me a lot of questions about AI and what it means for jobs and what people will do without jobs.

Just for some context, I have been in the field of automation for 20 years and I can confidently say I have directly eliminated multiple jobs that never came back. The first time I helped eliminate 3 jobs was over 13 years ago. So long before where AI is today.

My job role now has a goal from my company to achieve autonomous manufacturing by 2030, and we are well on our way. Our biggest challenge is, and has been even before AI, integrating systems. AI will not solve this challenge, but it will drive the necessity to finally integrate systems that have long been troublesome to integrate, because failing to do so will result in the failure of the company.

My grandma fully understands the consequences of a world without jobs. We talk about it almost daily now, because she sees more and more on the news about AI. I’m absolutely fascinated by her perspective. She grew up in the 30s and 40s in the middle of economic disparity and global war. Her family helped house black folk in the south in secret when they had no where to go. She’s seen some shit.

I’m working to help her understand an economy without jobs and money now, but it is a difficult concept for her to learn at 94. She can see and understand that it is coming though, and she regularly tells me I was right, when I’ve explained protests about AI and strikes that will be coming.

175

1

[Google DeepMind] Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning (arxiv.org)

submitted 1 week ago by bot@lemmit.online to c/singularity@lemmit.online

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/rationalkat on 2024-10-11 09:26:26+00:00.