Skip to content

meta-llama/Llama-3.2-90B-Vision-Instruct

Model Information

meta-llama/Llama-3.2-90B-Vision-Instruct is a multimodal instruction-tuned model from Meta's LLaMA 3.2 series. It extends the powerful language capabilities of the LLaMA 3.2 family with robust visual reasoning through integrated image understanding. The model is designed for tasks such as visual question answering, chart and document understanding, image captioning, and grounded dialogue.

  • Model Developer: Meta
  • Model Release Date: July 2024
  • Supported Languages: English (primary), with extended support for major European and Asian languages including French, Spanish, German, Portuguese, Hindi, Thai, and others for multilingual prompting.

Model Architecture

Llama-3.2-90B-Vision-Instruct is a 90B-parameter decoder-only transformer with multimodal capabilities.

Key components include:

  • Vision-Language Fusion: Integrates a vision encoder (e.g., image patch encoder) with the LLaMA 3.2 transformer backbone
  • Token Context Length: Supports 8K+ tokens
  • Image Input Format: Images are encoded into discrete tokens, allowing alignment with the text stream
  • Training:
    • Pretrained on paired image-text datasets (e.g., OCR, charts, natural images)
    • Instruction-tuned for grounded multimodal reasoning
    • Aligned using preference data for helpfulness and safety in vision-language tasks
  • Multimodal Capabilities:
    • Document understanding
    • Image captioning and VQA (Visual Question Answering)
    • Visual instruction following

Benchmark Scores

Category Benchmark Shots Metric LLaMA 3.2
90B Vision-Instruct
General MMLU (CoT) 0 Acc. (avg) 87.1
MMLU Pro (CoT) 5 Acc. (avg) 59.4
Steerability IFEval 92.6
Reasoning GPQA Diamond (CoT) 0 Accuracy 46.8
Code HumanEval 0 Pass@1 84.3
MBPP EvalPlus (base) 0 Pass@1 85.0
Math MATH (CoT) 0 Sympy Score 59.8
Tool Use BFCL v2 0 AST Macro Avg. 80.1
Multilingual MGSM 0 EM (exact match) 77.2

These results position LLaMA 3.2 90B Vision-Instruct among the strongest open-access multimodal models available as of mid-2024, combining strong text performance with grounded visual reasoning.


References