Video Foundation Models Disrupt Reality

Bottom Line Up Front: Video foundation models like Google DeepMind's Veo 3 are demonstrating rapid, emergent zero-shot capabilities across perception, modeling, manipulation, and visual reasoning tasks, indicating a trajectory toward general-purpose visual intelligence that will disrupt computer vision and create significant novel risks in misinformation, autonomous systems, and cybersecurity within 12-24 months. Threat Identification: We are facing the emergence of generalist visual foundation models. The core finding is that models trained solely on a generative video objective (predicting the next frame) develop broad capabilities without task-specific training [1, p. 1]. These include edge detection, segmentation, intuitive physics modeling (e.g., buoyancy, rigidity), 3D-aware image editing, and crucially, visual reasoning tasks like maze solving, graph traversal, and rule extrapolation via a "chain-of-frames" (CoF) process analogous to chain-of-thought in LLMs [1, pp. 3-4, 9]. This represents a paradigm shift from bespoke vision models to a single, multi-capability system. Probability Assessment: HIGH probability of rapid capability scaling. The performance leap from Veo 2 (Dec 2024) to Veo 3 (July 2025) is substantial and consistent across tasks (e.g., maze solving improved from 14% to 78% pass on 5x5 grids) [1, pp. 2, 7]. Given the historical precedent of LLMs and the authors' assertion that current performance is a "lower bound" due to prompt and frame engineering immaturity [1, p. 9], we assess that models 12 months from now will achieve robust, reliable performance on these zero-shot tasks. The economic drive toward generalist models ensures continued investment and scaling. Impact Analysis: The consequences are systemic. This technology will democratize high-fidelity visual content creation and manipulation, drastically lowering the barrier for generating hyper-realistic misinformation and synthetic media for influence operations. In security, it enables automated vulnerability discovery in physical and digital systems through visual pattern recognition (e.g., spotting weaknesses in infrastructure from imagery). For autonomy, it provides a path toward robust world models for robots and vehicles that understand physical dynamics and can plan manipulations zero-shot. The scope includes everything from media integrity to physical safety. Recommended Actions: 1. **Red Team Visual Models:** Immediately task threat intelligence units with stress-testing these models for misuse potential, specifically for generating tactical imagery, planning simple physical actions, and bypassing visual CAPTCHAs or authentication systems. 2. **Develop Detection Capabilities:** Invest in and deploy provenance and detection systems for AI-generated video and imagery at scale; treat all unverified visual media as potentially synthetic. 3. **Pressure-Test Physical Systems:** Audit critical physical infrastructure and security systems for vulnerabilities that could be identified and exploited by a visual reasoning model analyzing publicly available imagery or video feeds. 4. **Establish Governance Protocols:** Develop and advocate for internal and industry-wide standards on the development, testing, and deployment of generalist vision models, focusing on catastrophic misuse risks. Confidence Matrix: * **Trajectory toward generalist vision models:** HIGH confidence. Based on direct evidence from the research paper and the proven scaling laws from NLP [1, pp. 1, 9]. * **Timeline (12-24 months for robust capabilities):** MEDIUM confidence. Extrapolation from two model generations is preliminary, but the scaling curve appears steep. * **Misuse potential in misinformation:** HIGH confidence. The demonstrated editing and generation capabilities are directly applicable [1, pp. 4, 6]. * **Impact on autonomous systems:** MEDIUM confidence. The paper shows early physical reasoning and planning; real-world reliability remains to be proven [1, pp. 4, 7-8]. Citations: Video models are zero-shot learners and reasoners (https://arxiv.org/pdf/2509.20328), Emergent Spatial Reasoning in AI Models from Video Training Data (https://x.com/emollick/status/1974096724445503827)
Published October 5, 2025