Jina-VLM: 2.4B Multilingual Vision Model for Laptops

Jina AI has released Jina-VLM, a 2.4 billion-parameter Vision-Language Model (VLM) designed to operate on consumer-grade hardware, including laptops. The model achieves state-of-the-art benchmarks in Multilingual Visual Question Answering (Multilingual VQA) tasks within its scale.

The developers addressed challenges typically associated with small-parameter VLMs, such as performance degradation in text tasks when visual capabilities are enhanced, and the high inference costs of processing high-resolution images. Jina-VLM integrates an Attention-pooling mechanism, connecting the SigLIP2 visual encoder with a Qwen3 language base. This architecture supports 29 languages and handles question answering, understanding, and extraction tasks on natural and document images at various resolutions.

Core Performance and Multilingual Capabilities

Jina-VLM's design emphasizes architectural optimization to overcome parameter limitations. The model reportedly outperforms similarly sized VLMs, such as Qwen2-VL and InternVL series, across several benchmarks.

| Model | Parameters | VQA Avg. Score | MMMB Multi. | MMB DocVQA | OCRBench | |---------------|------------|----------------|-------------|------------|----------| | jina-vlm | 2.4B | 72.3 | 78.8 | 74.3 | 90.6 | 778 | | Qwen2-VL-2B | 2.1B | 66.4 | 71.3 | 69.4 | 89.2 | 809 | | Qwen3-VL-2B | 2.8B | 71.6 | 75.0 | 72.3 | 92.3 | 858 | | InternVL3-2B | 2.2B | 69.2 | 73.6 | 71.9 | 87.4 | 835 | | InternVL3.5-2B| 2.2B | 71.6 | 74.6 | 70.9 | 88.5 | 836 |

The model achieved a score of 78.8 in multilingual understanding tests across six languages: Arabic, Chinese, English, Portuguese, Russian, and Turkish. In visual question answering (VQA) tasks, Jina-VLM demonstrated robust performance in areas such as charts (ChartQA), documents (DocVQA), scene text (TextVQA), and scientific diagrams (CharXiv).

A key aspect of Jina-VLM's design is its ability to retain language performance while enhancing visual capabilities. The model reportedly maintains the performance of the Qwen3 base in pure text tasks, including MMLU (knowledge) and GSM-8K (mathematics).

Addressing Computational Challenges

Developing a 2 billion-parameter VLM presents a core engineering challenge: processing high-resolution images often leads to a "token explosion" in computational complexity. Traditional Vision Transformers (ViT) typically divide images into multiple tiles, generating a large number of visual tokens. For instance, a high-definition image processed at 378x378 pixels can generate nearly 9,500 visual tokens, leading to a quadratic increase in computational load within a Transformer architecture.

Jina-VLM addresses this by combining dynamic tiling at the input stage with intelligent compression within the model. This approach aims to preserve high-resolution visual information while reducing the number of visual tokens presented to the language model by approximately four times.

The model's architecture involves a data flow from the SigLIP2 visual encoder through a VL-Connector to the Qwen3 language base.

Dynamic Overlapping Tiling and Attention-Pooling

To accommodate SigLIP2's fixed input size and preserve detail, Jina-VLM employs a Dynamic Overlapping Tiling strategy. This involves generating a global thumbnail for overall layout comprehension and then using a sliding window to divide the original image into 378x378 tiles, with a 112-pixel overlap between tiles. This overlap is intended to prevent feature discontinuity for text or objects spanning multiple tiles.

The core innovation, according to the developers, is the 2x2 Attention-pooling mechanism within the VL-Connector, which achieves a 4x lossless compression. This mechanism extracts and concatenates two intermediate features from the visual encoder: Layer 18 for fine-grained details and Layer 24 for abstract semantic information. This combination aims to provide a high-information-density input.

The Attention Pooling mechanism then aggregates features by dividing the feature map into non-overlapping patch neighborhoods. It calculates the mean of features for these patches as a query vector and uses an attention mechanism to assign weights, prioritizing valid messages over background noise. The aggregated features are then projected into the Qwen3 vector space via an MLP layer with a SwiGLU activation function. This process compresses the output of a single tile from 729 tokens to 182 tokens while preserving the grid topological structure for spatial perception.

The optimization effect on visual tokens, LLM prefill computation, and KV-cache memory usage is detailed below for a 12-tile input:

The developers note that the visual encoder's computation is relatively fixed, and the optimization primarily impacts the language model's inference stage, which is a significant cost in VLM deployment.

Training Strategy and Catastrophic Forgetting

Jina-VLM's training process aims to mitigate "catastrophic forgetting," a phenomenon where models lose original text reasoning capabilities when adapting to new modalities. The model employs a two-stage training strategy combined with continuous incorporation of text-only data. The training scale covers 29 languages, approximately 5 million images, and 12 billion text tokens. The dataset includes roughly half English data, along with various other languages such as Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, and Bengali.

In the first "Alignment" stage, the goal is to map SigLIP2's visual features to Qwen3's text vector space using high-quality captioning datasets. During this stage, 15% pure text data is injected to prevent the model from losing its original language habits. Different learning rates are applied: a high rate for the connector for rapid convergence, and a low rate for the language base to maintain stability.

The second stage, "Instruction Tuning," focuses on improving the model's instruction following for tasks like VQA, OCR, and mathematical reasoning. To address gradient instability from heterogeneous training data, Jina-VLM uses a progressive "divide and conquer" strategy. This involves initial training with single-source data for task-specific characteristics, followed by mixed training with multi-source data. Pure text instruction data is continuously incorporated to preserve text reasoning and knowledge retrieval capabilities.

This training approach reportedly allows Jina-VLM to achieve strong performance on multilingual visual benchmarks while retaining the performance of the Qwen3-1.7B base on pure text benchmarks like MMLU and GSM-8K.

Availability and Future Outlook

Jina-VLM is available via an API service compatible with OpenAI interface specifications, accessible at https://api-beta-vlm.jina.ai. It supports streaming output and Base64 image input. Command-line tools and Hugging Face Transformers integration are also provided for local deployment and testing.

The developers acknowledge trade-offs in the architectural design. The tiling mechanism incurs a linear increase in computational cost for ultra-high-resolution scenarios and can fragment objects, potentially affecting object counting and spatial relationships. While a global thumbnail offers a fallback, native resolution solutions might be preferable for tasks requiring both detail and global consistency. The model's performance on multi-image inference is noted as relatively weaker due to scarce labeled data. Additionally, a preference for short-chain reasoning during training to achieve faster VQA response times may limit its performance in long-chain multi-step reasoning.

Future work will focus on exploring more efficient resolution processing mechanisms and evaluating the transferability of the multilingual training approach to larger-scale models.