Skip to content

intfloat/multilingual-e5-large

Model Information

intfloat/multilingual-e5-large is a multilingual text embedding model designed for tasks such as semantic search, information retrieval, and text similarity. Built upon the XLM-RoBERTa architecture, it has been continually trained on a mixture of multilingual datasets, enabling it to support a wide range of languages. The model produces 1024-dimensional embeddings and is optimized for high performance across various benchmarks.

  • Model Developer: Intfloat
  • Model Release Date: Mid-2023
  • Supported Languages: The model supports 100 languages inherited from XLM-RoBERTa. However, performance may vary, especially for low-resource languages. For optimal results, it's recommended to use the model primarily for English tasks.

Model Architecture

  • Base Model: XLM-RoBERTa-large
  • Number of Layers: 24
  • Embedding Size: 1024
  • Training Objective: Contrastive learning on multilingual datasets to produce high-quality text embeddings.

Benchmark Scores

Mr. TyDi Benchmark (Mean Reciprocal Rank @10)

Model Avg MRR@10 ar bn en fi id ja ko ru sw te th
BM25 33.3 36.7 41.3 15.1 28.8 38.2 21.7 28.1 32.9 39.6 42.4 41.7
mDPR 16.7 26.0 25.8 16.2 11.3 14.6 18.1 21.9 18.5 7.3 10.6 13.5
BM25 + mDPR 41.7 49.1 53.5 28.4 36.5 45.5 35.5 36.2 42.7 40.5 42.0 49.2
multilingual-e5-small 64.4 71.5 66.3 54.5 57.7 63.2 55.4 54.3 60.8 65.4 89.1 70.1
multilingual-e5-base 65.9 72.3 65.0 58.5 60.8 64.9 56.6 55.8 62.7 69.0 86.6 72.7
multilingual-e5-large 70.5 77.5 73.2 60.8 66.8 68.5 62.5 61.6 65.8 72.7 90.2 76.2

Note: Scores are based on the Mr. TyDi benchmark, which evaluates multilingual information retrieval performance.


References