Tutorial at the International Conference on Language Technologies for Low-resource Languages (LaTeLL 2026) — a forum dedicated to NLP for the 85% of the world's languages often left behind.
This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages — from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning.
The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.
We introduce key principles, workflows, and best practices in data annotation for NLP, with a focus on quality, scalability, and ethical considerations.
Three real-world pipelines showing how to build language technology when the data simply isn't there yet.
Our experience creating a corpus to evaluate language identification systems on web data.
Parallel sentence mining for low-resource languages, plus collecting low-resource MT data.
Cross-lingual text classification knowledge transfer for low-resource languages, and data labeling for image captioning and visual question answering in low-resource dialects.
An overview of insights gained from interviews with NLP experts actively involved in the creation of benchmarks for positive social impact applications.
Toloka AI, German UDS
Senior Research Engineer
Common Crawl Foundation
Postdoc @ TUM
Postdoc @ TUM
Toloka AI
Principal Research Scientist
Common Crawl Foundation
@inproceedings{lowresourcetutoriallatell2026,
title = {Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies},
author = {Artemova, Ekaterina and Burchell, Laurie and Dementieva, Daryna and Okabe, Shu and Shmatova, Mariya and Ortiz Suarez, Pedro},
booktitle = {Proceedings of LaTeLL 2026 — International Conference on Language Technologies for Low-resource Languages},
year = {2026},
address = {Fes, Morocco},
url = {https://arxiv.org/abs/2512.14576}
}