Skip to content

syscv-community/sam-hq-vit-base

Model Information

syscv-community/sam-hq-vit-base is a high-quality, efficient image segmentation model that builds upon the original Segment Anything Model (SAM). It delivers enhanced mask accuracy with minimal increase in computational demands, making it especially effective for scenarios requiring detailed segmentation, even when provided with vague or minimal prompts.

  • Model Developer: SYSCV Community
  • Model Release Date: May 2023 (SAM-HQ)
  • Supported Task: Image Segmentation via point prompt

Model Architecture

syscv-community/sam-hq-vit-base enhances the original SAM framework by modifying its decoder to include a High-Quality (HQ) output token. This addition allows the model to produce more detailed masks directly during inference, especially around object edges and fine structures. It maintains the same ViT-B (Vision Transformer - Base) backbone used in SAM, preserving the strengths of the original architecture.

While SAM relied on lower-resolution masks followed by upscaling, HQ-SAM generates high-resolution outputs natively, eliminating the need for additional refinement steps. These architectural improvements are achieved with minimal increase in computational cost, ensuring the model remains fast and responsive in real-time use cases.

Key Architecture Details

  • Model Type: Image Segmentation Model (Modified Transformer-based architecture)
  • Parameters: 362.1M
    • ~358M from the frozen ViT-B image encoder (inherited from SAM)
    • ~4.1M trainable parameters in the HQ mask decoder
  • Base Architecture: Vision Transformer (ViT-B) for image encoding
  • Enhancements: Integration of a High-Quality (HQ) output token into the mask decoder for improved mask fidelity.
  • Input:
    • RGB Image
    • Prompt (support in AI Refinery: points)
  • Output: High-quality segmentation masks
  • Training:
    • Inherits SAM’s pretraining on the SA-1B dataset (1B masks)
    • Fine-tuned with additional high-quality segmentation datasets to improve edge detail and structure accuracy
  • Capabilities:
    • Generates highly accurate segmentation masks from various prompts.
    • Handles ambiguous prompts with improved precision.
    • Optimized for a balance between speed and quality.

Benchmark Scores:

SAM-HQ (ViT-Base) demonstrates a measurable improvement in mask quality over the original SAM (ViT-Base) across various segmentation benchmarks, achieving higher precision with minimal computational overhead.

Category Benchmark Dataset Metric SAM-HQ (ViT-Base)
Mask Quality COCO Average Precision (AP) ~46.7
Mask Quality COCO Boundary AP 31.3

References