meta-llama/Llama-3.2-90B-Vision-Instruct¶
Model Information¶
meta-llama/Llama-3.2-90B-Vision-Instruct
is a multimodal instruction-tuned model from Meta's LLaMA 3.2 series. It extends the powerful language capabilities of the LLaMA 3.2 family with robust visual reasoning through integrated image understanding. The model is designed for tasks such as visual question answering, chart and document understanding, image captioning, and grounded dialogue.
- Model Developer: Meta
- Model Release Date: July 2024
- Supported Languages: English (primary), with extended support for major European and Asian languages including French, Spanish, German, Portuguese, Hindi, Thai, and others for multilingual prompting.
Model Architecture¶
Llama-3.2-90B-Vision-Instruct
is a 90B-parameter decoder-only transformer with multimodal capabilities.
Key components include:
- Vision-Language Fusion: Integrates a vision encoder (e.g., image patch encoder) with the LLaMA 3.2 transformer backbone
- Token Context Length: Supports 8K+ tokens
- Image Input Format: Images are encoded into discrete tokens, allowing alignment with the text stream
- Training:
- Pretrained on paired image-text datasets (e.g., OCR, charts, natural images)
- Instruction-tuned for grounded multimodal reasoning
- Aligned using preference data for helpfulness and safety in vision-language tasks
- Multimodal Capabilities:
- Document understanding
- Image captioning and VQA (Visual Question Answering)
- Visual instruction following
Benchmark Scores¶
Category | Benchmark | Shots | Metric | LLaMA 3.2 90B Vision-Instruct |
---|---|---|---|---|
General | MMLU (CoT) | 0 | Acc. (avg) | 87.1 |
MMLU Pro (CoT) | 5 | Acc. (avg) | 59.4 | |
Steerability | IFEval | – | – | 92.6 |
Reasoning | GPQA Diamond (CoT) | 0 | Accuracy | 46.8 |
Code | HumanEval | 0 | Pass@1 | 84.3 |
MBPP EvalPlus (base) | 0 | Pass@1 | 85.0 | |
Math | MATH (CoT) | 0 | Sympy Score | 59.8 |
Tool Use | BFCL v2 | 0 | AST Macro Avg. | 80.1 |
Multilingual | MGSM | 0 | EM (exact match) | 77.2 |
These results position LLaMA 3.2 90B Vision-Instruct among the strongest open-access multimodal models available as of mid-2024, combining strong text performance with grounded visual reasoning.