TL;DR: A hands-on LREC tutorial on building fair, end-to-end NLP pipelines for low-resource and multilingual languages — covering data collection to MT and downstream tasks — using real-world, community-informed methods and case studies across 10+ diverse languages.
Abstract
This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages—from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
Schedule
Part 1. Data Annotation Basics
In this section, we introduce key principles, workflows, and best practices in data annotation for NLP, with a focus on quality, scalability, and ethical considerations.
Part 2. Case studies
2.1 Pre-training Data Crawling and Filtering
This case study describes our experience in creating a corpus to evaluate language identification systems on web data.
2.2 Obtaining Machine Translation System
Parallel Sentence Mining for Low-Resource Languages & Collecting Low-resource Machine Translation Data.
2.3 Downstream Tasks System Acquisition
Cross-lingual Text Classification Knowledge Transfer for Low-resource Languages & Data Labeling for Image Captioning and Visual Question Answering in Low-Resource Dialects.
Part 3. Expert Interviews on Benchmark Creation in Low-Resource Settings
We provide an overview of insights gained from interviews with NLP experts actively involved in the creation of benchmarks for positive social impact applications..
Presenters
Ekaterina Artemova
Toloka AI, German UDS
Laurie Burchell
Senior Research Engineer @ Common Crawl Foundation
Daryna Dementieva
Postdoc @ TUM
Shu Okabe
Postdoc @ TUM
Mariya Shmatova
Toloka AI
Pedro Ortiz Suarez
Principal Research Scientist @ Common Crawl Foundation
Tutorial material
We will opensource slides from our tutorial soon.BibTeX
@article{lowresourcetutoriallrec2026,
title={Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies},
author={Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova, Pedro Ortiz Suarez},
journal={LREC},
year={2026},
url={https://tum-nlp.github.io/low-resource-tutorial/}
}