processing languages
of the global south

Processing languages of the Global South

What's this project about?

PLoGS is a project designed to support languages of the Global South through the development of computational tools. Most such languages are disadvantaged because of years of colonialism and linguistic imperialism. The so-called Linguistic Digital Divide leaves them further marginalized in comparison to languages like English, Spanish, and Chinese, with few materials accessible on the internet and few resources available for creating new materials. The Linguistic Digital Divide is not simply a technological problem, but technology may be able to play a role in supporting these under-resourced languages.

Software and research

All the software is free and available under a GNU General Public License, according to which you can use it for any purpose, change it to suit your needs, and share it with others.

Our work has focused on two types of tools: those for processing the morphology (structure of words) of particular languages and those for assisting in the translation of documents for particular pairs of languages.

Morphological processing (morfo)

For languages whose words have complex structure, morphological processing is an essential component in many applications. "Morphological processing" refers to two distinct processes, analysis, which extracts the root and grammatical properties of a given word, and generation, which realizes the reverse process. For example, given the Spanish word cambies, a morphological analyzer would recognize that it's a verb with the infinitive cambiar, in the present subjunctive present, and with a second person singular subject. And given the infinitive cambiar and the properties suj=2p and tmp=subj_pres, a morphological generator would produce the word cambies.

We have developed partial or relatively complete morphological processing software for six under-resourced languages, the Ethiopian and Eritrean languages Amharic, Oromo, and Tigrinya, and the indigenous American languages Guarani, Quechua, and Quiché. The code and morphological data for these languages, as well as Spanish, is available as the program morfo and can be downloaded here: A version of morfo that includes only the languages Amharic, Oromo, and Tigrinya, called HornMorpho, can be downloaded here:

Normally morphological processing is a component of a larger language processing system, for example, one that does machine translation. So our software would normally be used by a computer scientist who is developing such a system. However, In addition, we have developed a web application for the morphological analysis of the seven languages that is usable by anyone. It can be found here: .

Computer-assisted translation

One important way to increase the available materials in under-resourced languages is through the translation of materials from other languages. Although machine translation is not yet capable of producing publication-quality materials, computer-assisted systems can speed up the work of human translators. Most computer-assisted translation (CAT) software relies on translation memories, large databases of translation examples. For under-resourced languages, these databases don't exist yet, so the framework we are developing relies on grammatical knowledge to suggest translations for the user and saves the user's translations in a growing translation memory. Eventually the translation memory should improve the perfomance of the CAT system. Here is a technical description of the theory behind this approach, Minimal Dependency Translation (MDT).

We are currently developing two web applications that implement such a system, one for the language pair Spanish-Guarani, called Mainumby ("hummingbird" in Guarani), one for the language pair English-Amharic, called Mit'mit'a. These systems will be the first practical implementations of MDT. The code and data for these two programs are available here:,