Google DeepMind has introduced Agentic Vision, which serves as a fundamental new technology of Gemini 3 Flash that allows AI systems to interpret visual content. Agentic Vision enables the model to conduct detailed image analysis through its combination of visual reasoning and code execution, which leads to its ability to base all findings on verified visual data instead of making unchanging predictions.
From static vision to active investigation
Traditional frontier AI models typically process images in a single pass. The model needs to guess when users miss important details which include tiny serial numbers and distant street signs.
Agentic Vision changes this approach entirely. Gemini 3 Flash approaches image understanding through an active process because it treats vision as more than a temporary glimpse. The model begins by planning its tasks which it follows through to execute image processing code before it re-evaluates the outputs and reaches its ultimate conclusion. This shift allows Gemini to verify details visually and reason with greater precision.
According to Google DeepMind, enabling code execution with Gemini 3 Flash delivers a consistent 5–10% improvement in quality across most vision benchmarks.
How Agentic Vision works
At the core of Agentic Vision is a structured Think–Act–Observe loop:
- Think: The model analyzes the user’s query alongside the initial image and formulates a multi-step plan.
- Act: Gemini generates and executes Python code to manipulate or analyze the image—such as cropping, rotating, annotating, counting objects, or running calculations.
- Observe: The transformed image is added back into the model’s context, allowing it to inspect the updated visual data before responding.
This loop enables Gemini 3 Flash to ground its reasoning directly in pixel-level evidence.
Real-world use cases already emerging
Developers are already integrating Agentic Vision through the Gemini API and Google AI Studio, unlocking a wide range of applications:
1. Zooming and fine-grained inspection
Gemini 3 Flash can automatically zoom in on small details when needed.
PlanCheckSolver.com, an AI-driven building plan validation platform, reported a 5% accuracy improvement by enabling code execution. The model iteratively cropped and analyzed high-resolution sections—such as roof edges and structural details—then reinserted those images into its context to verify compliance with complex building codes.
2. Image annotation for precise reasoning
Agentic Vision allows Gemini to draw directly on images. For example, when asked to count fingers on a hand, the model uses Python to place bounding boxes and numeric labels over each detected finger. This visual “scratchpad” minimizes counting errors and ensures results are grounded in exact visual understanding.
3. Visual math and data visualization
High-density tables and multi-step visual arithmetic are common failure points for standard language models. Gemini 3 Flash avoids hallucinations by offloading calculations to a deterministic Python environment. In demonstrations, the model extracts raw data from images, normalizes values, and generates professional-grade Matplotlib charts—replacing guesswork with verifiable computation.
What’s coming next
Google DeepMind says Agentic Vision is only the beginning. Future updates are expected to include:
- More implicit code-driven behaviors: Capabilities like image rotation and visual math currently require explicit prompts, but the goal is to make these actions automatic.
- Additional tools: Planned integrations include web search and reverse image search to further ground visual understanding.
- Broader model support: Agentic Vision is set to expand beyond Gemini 3 Flash to other model sizes.
With Agentic Vision, Gemini 3 Flash marks a significant step toward AI systems that don’t just see images—but actively investigate, verify, and reason about them.