The differentiating factor for Catalan: the free and open software community

I have written about the importance of open technologies and communities for the digital survival and sovreignty of Catalan.

The online publication Pensem, invited experts to write a series of articles with the topic “Catalan, in the digital world.” I was fortunate enough to be asked also and I have written an article concerning the importance of free software and the efforts of the open source community specifically within the field of language technologies. Here is an English translation of it, I apologize ahead of time for the possible mistakes.

For many years, various entities have been promoting the inclusion of Catalan in market products. The efforts of the Platform for Language (Plataforma per la Llengua) are not only invaluable but also give us a memory of this struggle. One of the battles of this struggle is the inclusion of Catalan in digital products. The reason for the acceleration of activities in this direction is related to the important innovations in language technologies of the last five years. With the proliferation of the use of neural networks machine learning products on the market increased. From machine translation to speech synthesis, these tools make our lives easier and some are almost essential. But we know that the presence of Catalan is not enough yet. A very important example is the fact that none of the 32 voice assistants in the market speak Catalan , according to the InformeCat 2020 .

With this article I want to highlight the remedies that were taken to address this situation, as well as focusing on activities of various communities, especially the open and free software community. Because in some respects the solutions provided by the open and free software community are ahead of private companies and also universities for their applicability and accessibility. In my opinion, this is a differentiating factor that makes the Catalan language special, it also opens the path to technological sovereignty.to ensure the continuity of the language’s presence in the digital sphere. Before delving into the concepts, let’s start with the facts and news from the sector.

What are the strengths and weaknesses of Catalan in the digital sector?

The strongest point of Catalan is its potential to mobilize the community. The most obvious example is the activities of Softcatalà and its mobilization of the community to locate open and free software. Thanks to these efforts we have essential tools such as Firefox, Libre Office and Ubuntu in Catalan. In addition, the most impactful project is the impetus they gave to Common Voice , Mozilla’s project to collect speech sets, open and free. Thanks to the promotion of Softcatalà, now Catalan is one of the largest languages in the Common Voice corpus. Catalan is the fourth language (after English, German and Kinyarwanda) on the platform with 755 hours of validated recordings (data from May 2021).

At Col·lectivaT, the non-profit cooperative of which I am one of the co-founders, we have another strategy, to generate data sets and open tools taking advantage of existing resources. During our short life we created two important data sets , that of TV3 and ParlamentParla, taking advantage of the programming of TV3 and the recordings of the Parliament of Catalonia, respectively.

These activities are currently feeding more specific technologies and prototypes, the most prominent example being assistent.cat, which is the first virtual assistant in Catalan. It is the localized version of Mycroft, the open virtual assistant. For now, Mycroft does not provide Catalan in their own devices (in fact they do not support any language other than English) and the development of the tool was possible to multiple actors in the community, from members of Softcatalà and Col·lectivaT to translators Tradumatica students from the UAB and other developers . In addition, the development of this proof of concept was possible thanks to the other key open technologies, specifically our Catotron , the only system for the synthesis of speech in Catalan based on neural networks, and the system for recognizing he speaks it Vosk-Kaldi , which has comparable accuracy, if not better, than Google’s service. All of these tools are developed by the free software community.

The support of the Department of Culture during the last three years for the development of these tools and especially the expansion of the potential of the community itself should be highlighted. Through their support “to encourage the use of Catalan” they are giving an essential boost to the free software community.

Having said all this, from this point of view, the weakest point of Catalan is the development of large-scale commercial products . Although there are examples of products intended for end user as the Softcatalà translator, so far there is no example of the diffusion and adoption of these newly developed technologies in the commercial market in a particular product. This problem has many facets, and to understand the lack of innovative products in the market we must first consider the logic of capital.

Here we face the issue of companies in the global market and their lack of interest in providing products in Catalan. The main reason for this is the vision of companies to see Spain as a single market . That is, for them the support of the “most common” language would be enough to penetrate the peninsular territory. This is the logic of capital and business but it is not an inescapable reality.

First of all, this logic is based on the feeling of companies that there is a lack of demand for products in Catalan . Although there is sufficient evidence of market interest, sometimes not so obvious until a project or event with massive support appears. A recent example is the case of Maori, the indigenous language of New Zealand. After a community project to collect recordings of speakers , large companies began to have a lot of interest in the language. When they could not buy the rights to use the dataset, they launched an effort to collect a commercial dataset. The owner of the community dataset currently has funding to develop a mobile app to facilitate language learning, with a version already available.

In the case of Catalan, we know that the same provider - which probably works for Google - that manages the collection of speech data also executed another project for Catalan. This shows us that Catalan is not completely forgotten by large multinational companies , so it is a matter of time before the products reach the market. But we need to consider whether it is the desirable solution to bring advanced technologies to market for mass consumption.

How does the promotoion of Catalan fits in a bi-lingual context which Spanish is prioritized?

After talking about large multinational companies (or GAFAM; Google, Amazon, Facebook, Apple, Microsoft) and their possible interest in Catalan, we need to talk about another topic relevant to the situation of Catalan in Spain: technological sovereignty.

When we talk about ‘technological sovereignty’ we are generally referring to control over personal data, control over the processes or algorithms that run behind technological services and the possibility of repairing and / or modifying devices.. GAFAM products violate at least one of these principles; from the impossibility of modifying Apple devices, to the exploitation of personal data by Google to the Facebook algorithms that categorize users. In addition to these generic problems, we have another specific problem for the situation of Catalan as a minority language: the decision to offer Catalan or not in the services of large companies is a prerogative of theirs.. The dependence on the will of these companies is the problem that Catalan is currently suffering. Also, even if they decide to integrate it, there is no guarantee that these services will be maintained in the future. That is, technological sovereignty not only involves control over the use of technological products, but also ensures the longevity of the technologies developed. The existence of Catalan in Spain will always imply a certain danger for its digital survival.

Within the context of technological products, it is evident that the dominance of Spanish over Catalan is due to the logic of the capital and not because of an oppresive state or a discriminatory society -even if one feeds the other-. But one of the ways to ensure the continuous proliferation of Catalan in the digital field is to have a disposition of technological sovereignty. It is undeniable that Catalan is a living language, with a considerable presence in the various media, from books to audiovisual production, and therefore has every right to take place in the digital field.

What actions should be promoted to make Catalan available in all digital services?

From this point of view of technological sovereignty, it is important to support open and free technologies, to drive the creation of open data sets for the use of developers. The most important aspect of executing these actions is community support. The creation of formal and informal networks of developers, the interest of universities in contributing to existing projects and the organization of various activities such as hackathons are some of the most concrete actions to ensure the continuous development of language technologies in Catalan.


In this scenario there is another very important actor, which is the public administration. Another way to ensure the maintenance of developed technologies and generate linguistic data to improve technological products is to invest in a digital infrastructure of public administration . The use of open technologies in public services would ensure the maintenance of these technologies. In addition, in the other direction, public services could be important sources of data for the improvement of language technologies, such as the speech corpus of the ParlamentParla parliament, which is already being used by multiple open projects. In fact there is already a plan proposed by the Generalitat for invest in a language technology infrastructure that also provides for a language data platform.

What actors or resources would need to be activated to make this possible?

Within this proposed value chain, we are talking about some very clear actors such as community organizations and public administration. But there remains a very important piece that is the marketing of these open technologies , which involves another type of actor.

Marketing refers to the development of products, from the scalability of technology to distribution and the ability to serve customers. These tasks can be carried out only by private entities, and those that will have to bet are the SMEs of the territory, that is, the manufacturers that have a territorial root.

That way, we can to develop a network of technological actors, a sovereign and also resilient network, in the sense that the maintenance of technologies will be ensured, by the involvement of local private entities.

This strategy is important so as not to depend on large multinational companies. The Platform for the Language and the public administration have been pressuring GAFAM companies to integrate Catalan for years, but the results are still very limited. Appealing to authority has not yielded a concrete result in speech products yet, but in the meantime the community is creating new technologies, prototypes and simple products. It is now necessary to activate commercial initiatives in the territory and make them scalable. This will not only fill a gap in the provision of services (supply), but will also boost large companies indirectly because it will definitely show the preference of Catalan speakers for services in their language, ie a specific demand.


In short, in addition to investment, it is necessary to assume a strategy of technological sovereignty that involves actors from the free software community and private initiatives, to public administration. Open innovation has the potential to drive local initiatives, motivate large companies and also energize the community to adopt and maintain these key technologies. Over the last 3 years, Catalan has made a fairly significant leap, however we have one more step to bring the technological experience and new products to the market.

References of interest:

 


© 2021. All rights reserved.

Powered by Hydejack v8.4.0