Toloka1, Common Crawl Foundation2, TUM3
16 May 2026 (Sat.) @ LREC2026
Palau de Congressos de Palma, Palma, Mallorca, Spain
Morning session, Room 6

Abstract

This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages who seek to create more equitable and socially impactful language technologies. Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages—from data collection and web crawling to parallel sentence mining, machine translation, and downstream applications such as text classification and multimodal reasoning. The tutorial presents strategies for tackling the challenges of data scarcity and cultural variance, offering hands-on methods and modeling frameworks. We will focus on fair, reproducible, and community-informed development approaches, grounded in real-world scenarios. We will showcase a diverse set of use cases covering over 10 languages from different language families and geopolitical contexts, including both digitally resource-rich and severely underrepresented languages.

-->

Schedule

Part 1. Data Annotation Basics

In this section, we introduce key principles, workflows, and best practices in data annotation for NLP, with a focus on quality, scalability, and ethical considerations.

Part 2. Case studies

2.1 Pre-training Data Crawling and Filtering

This case study describes our experience in creating a corpus to evaluate language identification systems on web data.

2.2 Obtaining Machine Translation System

Parallel Sentence Mining for Low-Resource Languages & Collecting Low-resource Machine Translation Data.

2.3 Downstream Tasks System Acquisition

Cross-lingual Text Classification Knowledge Transfer for Low-resource Languages & Data Labeling for Image Captioning and Visual Question Answering in Low-Resource Dialects.

Part 3. Expert Interviews on Benchmark Creation in Low-Resource Settings

We provide an overview of insights gained from interviews with NLP experts actively involved in the creation of benchmarks for positive social impact applications..

Presenters

Presenter 1

Ekaterina Artemova

Toloka AI, German UDS

Presenter 2

Laurie Burchell

Senior Research Engineer @ Common Crawl Foundation

Presenter 3

Daryna Dementieva

Postdoc @ TUM

Presenter 4

Shu Okabe

Postdoc @ TUM

Presenter 5

Mariya Shmatova

Toloka AI

Presenter 6

Pedro Ortiz Suarez

Principal Research Scientist @ Common Crawl Foundation

Tutorial material

We will opensource slides from our tutorial soon.

BibTeX

@article{lowresourcetutoriallrec2026,
  title={Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies},
  author={Ekaterina Artemova, Laurie Burchell, Daryna Dementieva, Shu Okabe, Mariya Shmatova, Pedro Ortiz Suarez},
  journal={LREC},
  year={2026},
  url={https://tum-nlp.github.io/low-resource-tutorial/}
}