Low-Resource. High-Impact
LaTeLL 2026 · Tutorial

Low-Resource, High-Impact:
Building Corpora for Inclusive Language Technologies

Tutorial at the International Conference on Language Technologies for Low-resource Languages (LaTeLL 2026) — a forum dedicated to NLP for the 85% of the world's languages often left behind.

Ekaterina (Katya) Artemova1 · Laurie Burchell2 · Daryna Dementieva3 · Shu Okabe3 · Mariya Shmatova1 · Pedro Ortiz Suarez2
1Toloka  ·  2Common Crawl Foundation  ·  3TUM
30 September – 2 October 2026
Fes, Morocco
Hands-on tutorial · 10+ languages
Tutorial Proposal LaTeLL 2026 Cite
TL;DR — A hands-on tutorial on building fair, end-to-end NLP pipelines for low-resource and multilingual languages — covering data collection to MT and downstream tasks — using real-world, community-informed methods and case studies across 10+ diverse languages.
◆ ◆ ◆
Overview

Abstract


This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages — from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning.

The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

Outline

Schedule


Part 1

Data Annotation Basics

We introduce key principles, workflows, and best practices in data annotation for NLP, with a focus on quality, scalability, and ethical considerations.

Part 2

Case Studies

Three real-world pipelines showing how to build language technology when the data simply isn't there yet.

2.1 — Pre-training Data Crawling and Filtering

Our experience creating a corpus to evaluate language identification systems on web data.

2.2 — Obtaining a Machine Translation System

Parallel sentence mining for low-resource languages, plus collecting low-resource MT data.

2.3 — Downstream Task System Acquisition

Cross-lingual text classification knowledge transfer for low-resource languages, and data labeling for image captioning and visual question answering in low-resource dialects.

Part 3

Expert Interviews on Benchmark Creation in Low-Resource Settings

An overview of insights gained from interviews with NLP experts actively involved in the creation of benchmarks for positive social impact applications.

Further Reading

Reading List


The Team

Presenters


Ekaterina Artemova

Ekaterina Artemova

Toloka AI, German UDS

Laurie Burchell

Laurie Burchell

Senior Research Engineer
Common Crawl Foundation

Daryna Dementieva

Daryna Dementieva

Postdoc @ TUM

Shu Okabe

Shu Okabe

Postdoc @ TUM

Mariya Shmatova

Mariya Shmatova

Toloka AI

Pedro Ortiz Suarez

Pedro Ortiz Suarez

Principal Research Scientist
Common Crawl Foundation

Resources

Tutorial Material


Slides and hands-on materials will be released here closer to the conference.
Citation

BibTeX


@inproceedings{lowresourcetutoriallatell2026,
  title     = {Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies},
  author    = {Artemova, Ekaterina and Burchell, Laurie and Dementieva, Daryna and Okabe, Shu and Shmatova, Mariya and Ortiz Suarez, Pedro},
  booktitle = {Proceedings of LaTeLL 2026 — International Conference on Language Technologies for Low-resource Languages},
  year      = {2026},
  address   = {Fes, Morocco},
  url       = {https://arxiv.org/abs/2512.14576}
}