technology & linguistic justice
processing languages
of the global south

Processing languages of the Global South

Linguistic justice and the Linguistic Digital Divide

Years of colonialism and linguistic imperialism have left many languages of the Global South, and by extension, the communities that speak these languages, disadvantaged in comparison to the languages spoken in dominant nations and in nations where colonial languages have replaced the local languages. The struggle for linguistic justice works to create multilingual spaces where all languages are valued equally and speakers of different languages benefit from listening to and sharing with one other.

In the digital era, an important feature of linguistic injustice is the Linguistic Digital Divide (LDD), which separates a small number of digitally privileged languages (and the communities that speak, read, and write them) from other languages (and the communities that speak, read, and write them). The privileged languages include not only some spoken by very many people, such as English, Mandarin Chinese, and Spanish, but also some spoken by communities of fewer than 10 million people, such as Finnish, Danish, and Catalan. The disadvantaged languages inlude many that are spoken by relatively few people but also those spoken by more than 50 million people, such as Telugu and Burmese.

The LDD is represented mainly by the relative lack of documents and computational resources for the disadvantaged languages. It has several consequences:

it denies the advances of the Digital Revolution to the majority of people
it inhibits the participation of the majority in solving urgent national and international problems
it exaggerates class divisions within linguistic communities (because the well-off are often capable of using one of the privileged languages)
it diminishes the role of many languages, most of them already marginalized through a history of imperialism

In some sense, the LDD is only the latest in a long history of how technological revolutions empower limited groups of people but maintain and even strengthen divisions based on class, race, and gender that already exist. Gasser (2018) is a recent paper (in Spanish) that discusses this history.

The Linguistic Digital Divide is not simply a technological problem, but technology may be able to play a role in supporting these under-resourced languages. This is the motivation for the research within the project Processing Languages of the Global South.

PLoGS: software and research

Processing Languages of the Global South (PLoGS) is a project designed to support languages of the Global South that are on the disadvantaged end of the LDD through the development of computational tools.

All the software we develop is free and available under a GNU General Public License, according to which you can use it for any purpose, change it to suit your needs, and share it with others.

Our work has focused on two types of tools: those for processing the morphology (structure of words) of particular languages and those for assisting in the translation of documents for particular pairs of languages.

Morphological processing (morfo)

For languages whose words have complex structure, morphological processing is an essential component in many applications. "Morphological processing" refers to two distinct processes, analysis, which extracts the root and grammatical properties of a given word, and generation, which realizes the reverse process. For example, given the Spanish word cambies, a morphological analyzer would recognize that it's a verb with the infinitive cambiar, in the present subjunctive present, and with a second person singular subject. And given the infinitive cambiar and the properties suj=2p and tmp=subj_pres, a morphological generator would produce the word cambies.

We have developed partial or relatively complete morphological processing software for six under-resourced languages, the Ethiopian and Eritrean languages Amharic, Oromo, and Tigrinya, and the indigenous American languages Guarani, Quechua, and Quiché. The code and morphological data for these languages, as well as Spanish, is available as the program morfo and can be downloaded here: https://github.com/hltdi/morfo/. A version of morfo that includes only the languages Amharic, Oromo, and Tigrinya, called HornMorpho, can be downloaded here: https://github.com/hltdi/HornMorpho/.

Normally morphological processing is a component of a larger language processing system, for example, one that does machine translation. So our software would normally be used by a computer scientist who is developing such a system. However, In addition, we have developed a web application for the morphological analysis of the seven languages that is usable by anyone. It can be found here: http://plogs.soic.indiana.edu/morfo/ .

Computer-assisted translation

One important way to increase the available materials in under-resourced languages is through the translation of materials from other languages. Although machine translation is not yet capable of producing publication-quality materials, computer-assisted systems can speed up the work of human translators. Most computer-assisted translation (CAT) software relies on translation memories, large databases of translation examples. For under-resourced languages, these databases don't exist yet, so the framework we are developing relies on grammatical knowledge to suggest translations for the user and saves the user's translations in a growing translation memory. Eventually the translation memory should improve the perfomance of the CAT system. Here is a technical description of the theory behind this approach, Minimal Dependency Translation (MDT).

We are currently developing two web applications that implement such a system, one for the language pair Spanish-Guarani, called Mainumby ("hummingbird" in Guarani), one for the language pair English-Amharic, called Mit'mit'a. These systems will be the first practical implementations of MDT. The code and data for these two programs are available here: https://github.com/hltdi/mainumby/, https://github.com/hltdi/mitmita/.