Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies @ LaTeLL 2026

Overview

Abstract

This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages — from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning.

The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

Outline

Schedule

Part 1

Data Annotation Basics

We introduce key principles, workflows, and best practices in data annotation for NLP, with a focus on quality, scalability, and ethical considerations.

Part 2

Case Studies

Three real-world pipelines showing how to build language technology when the data simply isn't there yet.

2.1 — Pre-training Data Crawling and Filtering

Our experience creating a corpus to evaluate language identification systems on web data.

2.2 — Obtaining a Machine Translation System

Parallel sentence mining for low-resource languages, plus collecting low-resource MT data.

2.3 — Downstream Task System Acquisition

Cross-lingual text classification knowledge transfer for low-resource languages, and data labeling for image captioning and visual question answering in low-resource dialects.

Part 3

Expert Interviews on Benchmark Creation in Low-Resource Settings

An overview of insights gained from interviews with NLP experts actively involved in the creation of benchmarks for positive social impact applications.

The Team

Presenters

Ekaterina Artemova

Toloka AI, German UDS

Laurie Burchell

Senior Research Engineer
Common Crawl Foundation

Daryna Dementieva

Postdoc @ TUM

Shu Okabe

Postdoc @ TUM

Mariya Shmatova

Toloka AI

Pedro Ortiz Suarez

Principal Research Scientist
Common Crawl Foundation

Citation

BibTeX

@inproceedings{lowresourcetutoriallatell2026,
  title     = {Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies},
  author    = {Artemova, Ekaterina and Burchell, Laurie and Dementieva, Daryna and Okabe, Shu and Shmatova, Mariya and Ortiz Suarez, Pedro},
  booktitle = {Proceedings of LaTeLL 2026 — International Conference on Language Technologies for Low-resource Languages},
  year      = {2026},
  address   = {Fes, Morocco},
  url       = {https://arxiv.org/abs/2512.14576}
}

Low-Resource, High-Impact:
Building Corpora for Inclusive Language Technologies