intfloat/multilingual-e5-large¶
Model Information¶
intfloat/multilingual-e5-large
is a multilingual text embedding model designed for tasks such as semantic search, information retrieval, and text similarity. Built upon the XLM-RoBERTa architecture, it has been continually trained on a mixture of multilingual datasets, enabling it to support a wide range of languages. The model produces 1024-dimensional embeddings and is optimized for high performance across various benchmarks.
- Model Developer: Intfloat
- Model Release Date: Mid-2023
- Supported Languages: The model supports 100 languages inherited from XLM-RoBERTa. However, performance may vary, especially for low-resource languages. For optimal results, it's recommended to use the model primarily for English tasks.
Model Architecture¶
- Base Model: XLM-RoBERTa-large
- Number of Layers: 24
- Embedding Size: 1024
- Training Objective: Contrastive learning on multilingual datasets to produce high-quality text embeddings.
Benchmark Scores¶
Mr. TyDi Benchmark (Mean Reciprocal Rank @10)¶
Model | Avg MRR@10 | ar | bn | en | fi | id | ja | ko | ru | sw | te | th |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BM25 | 33.3 | 36.7 | 41.3 | 15.1 | 28.8 | 38.2 | 21.7 | 28.1 | 32.9 | 39.6 | 42.4 | 41.7 |
mDPR | 16.7 | 26.0 | 25.8 | 16.2 | 11.3 | 14.6 | 18.1 | 21.9 | 18.5 | 7.3 | 10.6 | 13.5 |
BM25 + mDPR | 41.7 | 49.1 | 53.5 | 28.4 | 36.5 | 45.5 | 35.5 | 36.2 | 42.7 | 40.5 | 42.0 | 49.2 |
multilingual-e5-small | 64.4 | 71.5 | 66.3 | 54.5 | 57.7 | 63.2 | 55.4 | 54.3 | 60.8 | 65.4 | 89.1 | 70.1 |
multilingual-e5-base | 65.9 | 72.3 | 65.0 | 58.5 | 60.8 | 64.9 | 56.6 | 55.8 | 62.7 | 69.0 | 86.6 | 72.7 |
multilingual-e5-large | 70.5 | 77.5 | 73.2 | 60.8 | 66.8 | 68.5 | 62.5 | 61.6 | 65.8 | 72.7 | 90.2 | 76.2 |
Note: Scores are based on the Mr. TyDi benchmark, which evaluates multilingual information retrieval performance.