Our Datasets and AI Tools
Build better language models with expertly curated datasets and enterprise-grade AI tools. Access expertly curated African language datasets and AI-powered tools to build more accurate, inclusive language technology
Available Datasets
| Dataset Name | Description | Curation Method | Records Curated | Languages | Data Type | Validation | Access | Action | Curation Year |
|---|---|---|---|---|---|---|---|---|---|
| Storytelling | African language storytelling corpus for speech recognition and NLP tasks | Collaborated | 24 | Igbo, Hausa, Yoruba, Dholuo | Speech + Transcript | Human QA | Public | Direct Access | 2026 |
| HealthBench-Africa Extension | Multilingual medical evaluation dataset extending the OpenAI HealthBench benchmark into African languages for AI safety and cross-lingual model evaluation | Adapted | 500 | Igbo, Yoruba, Nigerian Pidgin, Kikuyu | Non-Parallel Text | AI Validated + Human QA | Public | Direct Access | 2026 |
| Swahili Parallel Text Extension | English extension of the Kidaw'ida, Kalenjin and Dholuo parallel corpus via MT and human validation | Adapted | 29,230 | Swahili ↔ English, Kidaw'ida, Kalenjin, Dholuo | Parallel Text - MT | AI Validated + Human QA | Public | Direct Access | 2026 |
| XNLI | Cross-lingual natural language inference for reasoning tasks | Adapted | 31,500 | Igbo, Kinyarwanda, Kikuyu, Luo, Yoruba, Hausa, Nigerian Pidgin | Non-Parallel Text | Human QA | Public | Direct Access | 2025 |
| KKD Parallel Corpora | Kiswahilli-African language parallel text for machine translation | Adapted | 29,231 | Kiswahili ↔ English, Kidaw'ida, Kalenjin and Dholuo | Parallel Text - MT | Human QA | Public | Direct Access | 2025 |
| MRL-Benchmark | Commonsense reasoning benchmarking dataset for LLMs | Collaborated | 400 | Nigerian Pidgin, Yoruba | Non-Parallel Text | Human QA | Public | Direct Access | 2025 |
AI-Powered Language Tools
Language Data Translation Validation Tool
Automatically validate translation accuracy and cultural appropriateness at scale.
For technical inquiries or data access requests, please contact us at
Have a question or want to get started? Reach out to our team.
amol@tonative.org