Research

Paper

AI LLM March 05, 2026

VietJobs: A Vietnamese Job Advertisement Dataset

Authors

Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj

Abstract

VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.

Metadata

arXiv ID: 2603.05262
Provider: ARXIV
Primary Category: cs.CL
Published: 2026-03-05
Fetched: 2026-03-06 14:20

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.05262v1</id>\n    <title>VietJobs: A Vietnamese Job Advertisement Dataset</title>\n    <updated>2026-03-05T15:12:02Z</updated>\n    <link href='https://arxiv.org/abs/2603.05262v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.05262v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-05T15:12:02Z</published>\n    <arxiv:comment>10 pages</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <arxiv:journal_ref>Language Resources and Evaluation Conference (LREC) 2026</arxiv:journal_ref>\n    <author>\n      <name>Hieu Pham Dinh</name>\n    </author>\n    <author>\n      <name>Hung Nguyen Huy</name>\n    </author>\n    <author>\n      <name>Mo El-Haj</name>\n    </author>\n  </entry>"
}