Zhang Xiangyu's Multimodal AI Odyssey: Unraveling Anomalies and Forecasting GPT-4 Futures

SEN LIAN

7-4

Mia: Okay, so get this. There's this absolutely wild, completely backwards idea popping up in the AI world, right? Because you'd naturally think, Bigger model, smarter everything, duh! Wouldn't you?

Mars: Right? That's the playbook! More data, more parameters, all the good stuff, performance goes through the roof. But oh boy, does this story take a hard left turn. Turns out there's this enormous, head-scratching exception to that whole rule, a genuine enigma that folks like Zhang Xiangyu have been wrestling with for years.

Mia: So, let's rewind a bit, because this wasn't some aha! moment over a coffee break. What was it actually like for these researchers, especially trying to get vision models to pull off the same kind of mind-blowing magic we see with language models like GPT?

Mars: Oh, it was a proper grind, like a decade-long slog. The big dream at first was this beautiful multimodal AI, merging vision and language. But then, *thwack!* They slammed right into a brick wall. Language models, they just waltzed in, unifying generation, understanding, and even human alignment into this gorgeous, seamless package. But the visual world? Total chaos! Static images were just this fragmented mess. You couldn't get *one* vision model to do *any* of those things well. It was honestly a time of just profound, head-spinning confusion.

Mia: Wow, that sounds like a journey filled with more twists and turns than a pretzel. And speaking of unexpected, that leads us right into this absolutely mind-bending phenomenon they stumbled upon during large model training. Lay it on us, what happened?

Mars: Okay, so this is where it gets really bonkers. As they started scaling up these monstrous models, in some areas, everything was exactly as advertised: conversational ability, emotional intelligence, general knowledge – all shot up. But then, in this truly bizarre turn of events, their critical reasoning skills, especially in math, would improve, hit a plateau, and then, get this, actually start getting *worse*.

Mia: Hold on, hold on. They got *dumber* at math the bigger they became? That's just... that defies all logic! Once they spotted this utterly baffling paradox, what was the deep dive into figuring out *why* on earth this was happening?

Mars: So, the culprit was eventually identified as this fundamental, inherent flaw in the next token prediction game. See, the model gets so laser-focused on just guessing the *very next word* in a sequence that it completely loses its grip on holding a complex, multi-step line of reasoning. It's like that super smart kid who can memorize every formula under the sun but then totally blanks when you ask them to actually *solve* a real-world problem from beginning to end.

Mia: Okay, so how on earth do you even begin to fix something like that? How do you teach a model to stop playing the guess the next word game and actually, you know, *think*?

Mars: Well, it was a pretty radical shift in philosophy, moving towards what they've dubbed the O1 paradigm. Basically, it's all about teaching the model a *pattern* of thinking, what we now famously call Chain of Thought. Instead of just spitting out an answer, the model learns to lay out its reasoning step-by-step, exactly like when your math teacher made you show your work. It's about making the *process*, the actual *pattern* of thought, the absolute priority.

Mia: That sounds like it has some seriously profound implications, doesn't it? So, how have these wild new understandings completely flipped the script on multimodal research, and what's peeking over the horizon for us?

Mars: Oh, it's not just changed the game, it's like they invented a whole new sport! Now they're taking these Chain of Thought principles and slapping them onto vision, creating stuff like visual Long CoT. But the *real* GPT-4 moments we're looking at on the horizon? Think multi-model collaboration, and even more mind-blowing, online or autonomous learning—AI that can just keep soaking up new information without needing a full-blown brain transplant every five minutes.

Mia: So, when you zoom out, what's the absolute endgame here? Where are we actually headed with all of this as we squint our eyes towards 2025 and beyond?

Mars: The ultimate, ultimate goal is to finally move past just simple prediction and genuinely help AI develop its own *internal world model*. I mean, think about it—we humans, we don't have some special brain bit just for spitting out text or generating images, right? But we've got this incredibly deep, intuitive understanding of how the world just *works*. That, my friend, is the next frontier. This whole wild journey has been nothing short of an odyssey, not just to build bigger models, but to unravel these weird, fundamental anomalies and finally, truly teach AI how to *learn*.

大纲

This podcast episode features Zhang Xiangyu, Chief Scientist at Jiyue Xingchen, discussing the decade-long evolution of multimodal AI research. He shares insights into past struggles, current breakthroughs like the 'o1' paradigm, and a puzzling observation about large language models' reasoning capabilities. The conversation also forecasts the potential "GPT-4 moments" for multimodal AI, emphasizing online learning and multi-model collaboration.

The Decades-Long Journey of Multimodal AI Research

Zhang Xiangyu recounts his 10-year involvement, detailing early struggles and shifts from Computer Vision (CV) to Natural Language Processing (NLP) approaches.
Initial pessimism arose regarding pure visual models achieving a "GPT moment" due to the inability to integrate generation, understanding, and human alignment.
A crucial turning point occurred after extensive, initially confusing, attempts to unify image understanding and generation.

Anomalous Observations in Large Model Training

A peculiar finding: While general dialogue, emotional intelligence, and knowledge improve with model scale, reasoning ability (especially mathematical) peaks and then declines.
This "strange phenomenon" is attributed to the inherent limitations of "next token prediction" and "feature collapse" in larger models.
Proposed solutions include incorporating Reinforcement Learning (RL) and the "o1" paradigm, which centers on Chain of Thought (CoT) patterns (Meta-CoT).

New Paradigms and Progress in Multimodal Understanding

The "o1" paradigm's essence is "Meta-CoT," which generalizes reasoning patterns across domains, not just specific tasks.
Current research focuses on "visual understanding" through "visual space Long CoT" to address poor controllability in visual generation.
The difficulty in fusing generation and understanding without a critical Chain of Thought (CoT) component is highlighted.

Forecasting Future "GPT-4 Moments" in Multimodal AI

Key areas for the next "GPT-4 moment" include advances in long context processing and multi-model collaboration.
The most significant predicted "GPT-4 moment" is the advent of online learning/self-learning capabilities in models.
The discussion also touches upon the role of Agents and the human "world model" as a critical aspect of intelligence.

脚本

Mia: Okay, so get this. There's this absolutely wild, completely backwards idea popping up in the AI world, right? Because you'd naturally think, Bigger model, smarter everything, duh! Wouldn't you?

Mia: Okay, so how on earth do you even begin to fix something like that? How do you teach a model to stop playing the guess the next word game and actually, you know, *think*?

Mia: So, when you zoom out, what's the absolute endgame here? Where are we actually headed with all of this as we squint our eyes towards 2025 and beyond?