UIC & Adobe's VGent: AI Multi-Object Localization Breakthrou

A research team from the University of Illinois Chicago (UIC) and Adobe has developed VGent, a new modular architecture designed to enhance multi-object localization in visual grounding tasks. The model aims to improve both inference speed and performance, particularly in scenarios involving multiple targets and visual references.

VGent, with fewer than 16 billion parameters, demonstrated an average F1 score improvement of 18.24 points over Qwen3-VL-30B on the Omnimodal Referring Expression Segmentation (ORES) benchmark. The architecture maintains a consistent inference speed regardless of the number of targets, addressing a limitation of existing models.

A glowing, transparent 3D cube with internal lines and nodes, symbolizing complex data parameters in AI models.

Addressing Challenges in Visual Grounding

Visual grounding is a key component for fine-grained reasoning in Multimodal Large Language Models (MLLMs) and is essential for human-computer interaction and embodied intelligence. Current solutions typically fall into two categories:

The "native-token approach," used by models such as Qwen2.5-VL and Ferret-v2, generates bounding box coordinates sequentially. This method can be slow, with inference time increasing linearly with the number of targets, and is prone to hallucinations in multi-target environments. Models may prematurely stop or enter infinite generation loops in dense scenes.

The "new-token approach" introduces special tokens like [SEG] to refer to target objects. This requires extensive dataset collection and rebuilding MLLMs from scratch, potentially compromising the general reasoning capabilities acquired during pre-training. It also prevents direct use of advanced, pre-trained open-source MLLMs.

Diptych showing a messy desk (native-token approach) and an organized desk (new-token approach), symbolizing AI challenges.

VGent's Modular Architecture

The UIC and Adobe researchers proposed VGent, a modular encoder-decoder architecture that assigns high-level semantic reasoning to the MLLM and low-level pixel prediction to an object detector. This decoupled relationship is connected through hidden states. The researchers posited that semantic reasoning and precise localization are distinct capabilities, and forcing a single model to handle both leads to performance and efficiency trade-offs.

VGent's design leverages the complementary strengths of MLLMs and detectors: MLLMs excel at multimodal semantic alignment and reasoning, while detectors efficiently provide accurate multi-object detection boxes.

Two interlocking gears, labeled 'MLLM' and 'Detector,' symbolizing VGent's modular architecture.

The architecture comprises an encoder and a decoder, supplemented by three modular enhancement mechanisms. The encoder, an MLLM, uses QuadThinker to improve multi-object reasoning. The frozen encoder outputs hidden states for the decoder. The decoder, initialized from the MLLM's LLM layer, uses object proposals from the detector as queries, interacting with the encoder's hidden states via cross-attention. Self-attention layers within the decoder facilitate information exchange between queries. The final output involves a binary judgment to select target proposals, with segmentation masks obtained via Prompt Segment Anything Model (SAM).

Enhancements for Multi-Object Reasoning

To address MLLM performance decline in multi-object scenarios, the researchers introduced QuadThinker, a reinforcement learning paradigm based on GRPO. This guides the model through a region-to-global, step-by-step reasoning process, first counting targets in image quadrants, then summing them, and finally predicting specific coordinates.

VGent also incorporates Mask-aware Labeling to resolve ambiguity between detection and segmentation tasks. Detection typically optimizes "one-to-one" matching, while segmentation aims to recall all foreground pixels. Mask-aware Labeling uses the Intersection-over-Area (IoA) metric for additional label assignment, allowing the model to recall valid proposals that cover only part of a target group.

To improve candidate box selection accuracy, VGent includes a Global Target Recognition module. This module aggregates proposals from multiple detectors and introduces learnable queries trained to predict the total number of targets and positive sample proposals. These queries interact with proposal queries through self-attention, propagating global statistical information to each candidate box for more precise selection.

A digital screen displaying a complex multi-object detection scenario with precise bounding boxes.

Experimental Performance

Evaluations on the ORES benchmark showed VGent achieving new state-of-the-art results. It recorded a 20.58% F1 score improvement over the previous best method, RAS13B, and also improved gIoU and cIoU scores. VGent maintained a constant and fast inference speed, avoiding the linear latency increase seen in auto-regressive models.

On traditional single-target benchmarks (RefCOCO, RefCOCO+, RefCOCOg), VGent achieved an average accuracy of 90.1%, outperforming larger models like InternVL3.5-20B and 38B. It showed an average performance improvement of 3.5% compared to its backbone, Qwen2.5-VL-7B.

Visualizations demonstrated VGent's robustness in complex scenarios, accurately localizing multiple similar objects and successfully inferring targets from visual references while excluding distractors.