Paper
Fanar 2.0: Arabic Generative AI Stack
Authors
FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang
Abstract
We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2603.16397v1</id>\n <title>Fanar 2.0: Arabic Generative AI Stack</title>\n <updated>2026-03-17T11:35:21Z</updated>\n <link href='https://arxiv.org/abs/2603.16397v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2603.16397v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <published>2026-03-17T11:35:21Z</published>\n <arxiv:primary_category term='cs.CL'/>\n <author>\n <name> FANAR TEAM</name>\n </author>\n <author>\n <name>Ummar Abbas</name>\n </author>\n <author>\n <name>Mohammad Shahmeer Ahmad</name>\n </author>\n <author>\n <name>Minhaj Ahmad</name>\n </author>\n <author>\n <name>Abdulaziz Al-Homaid</name>\n </author>\n <author>\n <name>Anas Al-Nuaimi</name>\n </author>\n <author>\n <name>Enes Altinisik</name>\n </author>\n <author>\n <name>Ehsaneddin Asgari</name>\n </author>\n <author>\n <name>Sanjay Chawla</name>\n </author>\n <author>\n <name>Shammur Chowdhury</name>\n </author>\n <author>\n <name>Fahim Dalvi</name>\n </author>\n <author>\n <name>Kareem Darwish</name>\n </author>\n <author>\n <name>Nadir Durrani</name>\n </author>\n <author>\n <name>Mohamed Elfeky</name>\n </author>\n <author>\n <name>Ahmed Elmagarmid</name>\n </author>\n <author>\n <name>Mohamed Eltabakh</name>\n </author>\n <author>\n <name>Asim Ersoy</name>\n </author>\n <author>\n <name>Masoomali Fatehkia</name>\n </author>\n <author>\n <name>Mohammed Qusay Hashim</name>\n </author>\n <author>\n <name>Majd Hawasly</name>\n </author>\n <author>\n <name>Mohamed Hefeeda</name>\n </author>\n <author>\n <name>Mus'ab Husaini</name>\n </author>\n <author>\n <name>Keivin Isufaj</name>\n </author>\n <author>\n <name>Soon-Gyo Jung</name>\n </author>\n <author>\n <name>Houssam Lachemat</name>\n </author>\n <author>\n <name>Ji Kim Lucas</name>\n </author>\n <author>\n <name>Abubakr Mohamed</name>\n </author>\n <author>\n <name>Tasnim Mohiuddin</name>\n </author>\n <author>\n <name>Basel Mousi</name>\n </author>\n <author>\n <name>Hamdy Mubarak</name>\n </author>\n <author>\n <name>Ahmad Musleh</name>\n </author>\n <author>\n <name>Mourad Ouzzani</name>\n </author>\n <author>\n <name>Amin Sadeghi</name>\n </author>\n <author>\n <name>Husrev Taha Sencar</name>\n </author>\n <author>\n <name>Mohammed Shinoy</name>\n </author>\n <author>\n <name>Omar Sinan</name>\n </author>\n <author>\n <name>Yifan Zhang</name>\n </author>\n </entry>"
}