- Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code. 5 authors · May 15, 2023
1 Finding Answers from the Word of God: Domain Adaptation for Neural Networks in Biblical Question Answering Question answering (QA) has significantly benefitted from deep learning techniques in recent years. However, domain-specific QA remains a challenge due to the significant amount of data required to train a neural network. This paper studies the answer sentence selection task in the Bible domain and answer questions by selecting relevant verses from the Bible. For this purpose, we create a new dataset BibleQA based on bible trivia questions and propose three neural network models for our task. We pre-train our models on a large-scale QA dataset, SQuAD, and investigate the effect of transferring weights on model accuracy. Furthermore, we also measure the model accuracies with different answer context lengths and different Bible translations. We affirm that transfer learning has a noticeable improvement in the model accuracy. We achieve relatively good results with shorter context lengths, whereas longer context lengths decreased model accuracy. We also find that using a more modern Bible translation in the dataset has a positive effect on the task. 2 authors · Oct 26, 2018
- The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages Efficiently and accurately translating a corpus into a low-resource language remains a challenge, regardless of the strategies employed, whether manual, automated, or a combination of the two. Many Christian organizations are dedicated to the task of translating the Holy Bible into languages that lack a modern translation. Bible translation (BT) work is currently underway for over 3000 extremely low resource languages. We introduce the eBible corpus: a dataset containing 1009 translations of portions of the Bible with data in 833 different languages across 75 language families. In addition to a BT benchmarking dataset, we introduce model performance benchmarks built on the No Language Left Behind (NLLB) neural machine translation (NMT) models. Finally, we describe several problems specific to the domain of BT and consider how the established data and model benchmarks might be used for future translation efforts. For a BT task trained with NLLB, Austronesian and Trans-New Guinea language families achieve 35.1 and 31.6 BLEU scores respectively, which spurs future innovations for NMT for low-resource languages in Papua New Guinea. 10 authors · Apr 19, 2023