Research

Paper

AI LLM March 19, 2026

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Authors

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Metadata

arXiv ID: 2603.19223
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-19
Fetched: 2026-03-20 06:02

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.19223v1</id>\n    <title>F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World</title>\n    <updated>2026-03-19T17:59:21Z</updated>\n    <link href='https://arxiv.org/abs/2603.19223v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.19223v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <published>2026-03-19T17:59:21Z</published>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Ziyin Zhang</name>\n    </author>\n    <author>\n      <name>Zihan Liao</name>\n    </author>\n    <author>\n      <name>Hang Yu</name>\n    </author>\n    <author>\n      <name>Peng Di</name>\n    </author>\n    <author>\n      <name>Rui Wang</name>\n    </author>\n  </entry>"
}