
Zhang Xiangyu's Multimodal AI Odyssey: Unraveling Anomalies and Forecasting GPT-4 Futures
SEN LIAN
0
7-4This podcast episode features Zhang Xiangyu, Chief Scientist at Jiyue Xingchen, discussing the decade-long evolution of multimodal AI research. He shares insights into past struggles, current breakthroughs like the 'o1' paradigm, and a puzzling observation about large language models' reasoning capabilities. The conversation also forecasts the potential "GPT-4 moments" for multimodal AI, emphasizing online learning and multi-model collaboration.
The Decades-Long Journey of Multimodal AI Research
- Zhang Xiangyu recounts his 10-year involvement, detailing early struggles and shifts from Computer Vision (CV) to Natural Language Processing (NLP) approaches.
- Initial pessimism arose regarding pure visual models achieving a "GPT moment" due to the inability to integrate generation, understanding, and human alignment.
- A crucial turning point occurred after extensive, initially confusing, attempts to unify image understanding and generation.
Anomalous Observations in Large Model Training
- A peculiar finding: While general dialogue, emotional intelligence, and knowledge improve with model scale, reasoning ability (especially mathematical) peaks and then declines.
- This "strange phenomenon" is attributed to the inherent limitations of "next token prediction" and "feature collapse" in larger models.
- Proposed solutions include incorporating Reinforcement Learning (RL) and the "o1" paradigm, which centers on Chain of Thought (CoT) patterns (Meta-CoT).
New Paradigms and Progress in Multimodal Understanding
- The "o1" paradigm's essence is "Meta-CoT," which generalizes reasoning patterns across domains, not just specific tasks.
- Current research focuses on "visual understanding" through "visual space Long CoT" to address poor controllability in visual generation.
- The difficulty in fusing generation and understanding without a critical Chain of Thought (CoT) component is highlighted.
Forecasting Future "GPT-4 Moments" in Multimodal AI
- Key areas for the next "GPT-4 moment" include advances in long context processing and multi-model collaboration.
- The most significant predicted "GPT-4 moment" is the advent of online learning/self-learning capabilities in models.
- The discussion also touches upon the role of Agents and the human "world model" as a critical aspect of intelligence.