Does the internet speak your language? Launching the first-ever State of the Internet’s Languages report

31 مارس 2022

There are more than 7000 (signed and spoken) languages in the world, yet only a handful of them can be fully experienced online. As a result, the internet we know is not even close to being as multilingual and multimodal as we are in physical, embodied, life. Language is also much more than simply a means to communicate: it is how we express what we think, believe, and know. To be multilingual is to honor and affirm the full richness and textures of our many selves and our different worlds better. But what would a truly multilingual and multimodal internet look, feel, and sound like?

Visit the State of the Internet’s Languages Report →

The first-ever State of the Internet’s Languages report, co-organized by Whose Knowledge?, Oxford Internet Institute, and the Centre for Internet and Society (India) was born out of this question. The project is structured around three axes: mapping out the current status of languages online; identifying challenges and opportunities to make the web more multilingual; and advancing an agenda for action. This initiative has been in the making for over two years, with most of the work of analysis, interviews, and writing happening during the global pandemic of COVID-19. It has taken a world, not just a village, to bring this report to life, with nearly one hundred people involved from authors to translators and community reviewers.

This digital, community-sourced report brings together contributions in 13 languages, representing 22 language communities in 12 countries from every populated continent of the world. The Numbers section analyzes critical language issues in 39 widely-popular digital platforms, apps, and devices that we use in our everyday lives, with deep insights on platforms like Wikipedia and Google Maps. The report also anchors this data analysis in the deep Stories that offer us a collage of how different people and communities around the world experience the internet in their own languages.

Our findings demonstrate that the web is nowhere near as multilingual as we imagine or need it to be. Roughly 500 of over 7000 spoken and signed languages are represented online in any form of information or knowledge. Meanwhile, 75% of those who access the internet do so in only ten languages. These languages — such as English, Chinese Mandarin, Spanish, and French — often have a European colonial history, or are regionally dominant. Historical and ongoing structures of power and privilege are intrinsic to the way in which languages are accessible (or not) online.

A multilingual launch and celebration

Screenshot of the State of the Internet’s Languages Report launch event.

We launched the State of the Internet’s Languages report on February 23, 2022; in line with the celebrations of UNESCO’s International Decade of Indigenous Languages (2022-2032) and the International Mother Language Day.

After over two years of working on this report, we knew that the launch had to honor its multilingual and community-centered processes and principles. We started the conversation by clearly laying out three core beliefs and commitments: love, respect, and solidarity. This set of guiding principles allowed us to be aware of our positionalities and privileges, and be our full, multiple selves during the discussion.

We challenged ourselves to put our multilingual principles into practice – even if that meant pushing any video conferencing software to its limits. We knew that we wanted participants – most of whom did not grow up speaking English first – to speak in their languages of choice wherever possible so that they could be fully themselves as they shared their stories and analysis. Embracing multiple languages allows for the richness and depth of our experiences to be shared more fully and meaningfully.

With that in mind, our moderators — co-directors Adele Godoy Vrana and Anasuya Sengupta, and communications co-lead Claudia Pozo — spoke in Brazilian Portuguese, English, and Spanish. Meanwhile, our panelists expressed themselves in English, Spanish, and Bangla, and we secured simultaneous interpretation in five languages: English, Spanish, Portuguese, Arabic, and Bangla. Over 70 partners, friends, and allies joined the call, while the event was also live-streamed to YouTube.

Telling the Stories

Illustrations for the Stories of the State of the Internet’s Languages Report.

We kicked off the event with a section dedicated to the Stories brought together by the report, mediated by our communications co-lead Claudia Pozo. Our panelists Ana Alonso (Mexico), Paska Darmawan (Indonesia), Claudia Soria (Italy), and Ishan Chakraborty (India), who made their interventions in Spanish, English, and Bangla respectively,  explained the obstacles they face when using the internet and shared their hopes for a more multilingual internet. Their interventions stressed the need for a multilingual and multimodal web and its importance for marginalized communities across the globe. Building on her experience with Zapotec, Ana highlighted how the internet falls short in providing space for languages that do not have a standard writing system — this jeopardizes efforts for access, documentation, and preservation.

Speakers questioned what type of content is available in different languages, and what this means in the lived experiences of marginalized communities. As a queer person with visual disability, Ishan brought to the spotlight the intersections of marginalities in online content – as he also did in his contribution to the report, available in English and Bengali. “It’s my duty to create knowledge in my own language,” he said.

Paska, from Indonesia, shared their experience while searching for LGBTQIA+ content in Bahasa Indonesia. Growing up, they witnessed how numerous texts and pages would present biased, harmful views on LGBTQIA+ people. They would have to turn to English to get answers to their questions in content that embraced their identity. “For those people who aren’t able to speak English, I can only imagine how devastating it must be to try to find an answer to their questions – a content that would affirm their identity – and being unable to.” 

Other stories in our report include the challenges encountered by minority languages speakers in language technology, as well as the political, social, cultural elements that pose challenges to speakers of Chindali in Northern Malawi. From search engines to social media platforms, contributions showcase how, as researcher Claudia Soria puts it, “technology is never neutral: it is developed by humans and reflects their mindset and culture.”

Some essays provide a glimpse of the existing efforts for reclaiming and keeping alive indigenous languages across the globe. Projects like Indigemoji draw on a multi-generational effort to create emojis that represent the Arrernte indigenous language; while others document indigenous people’s utilization of social media platforms, such as Twitter and its hashtags, for the survival and learning of their languages.

Looking at the Numbers

Screenshots of the essays of the Numbers section of the State of the Internet’s Languages Report.

The Numbers and patterns analyzed in the report demonstrate how these inequities of language permeate even the most widely-popular digital platforms. In a panel moderated by Anasuya Sengupta, contributors explored trends and insights obtained from the data: Martin Dittus (Germany/United Kingdom); Puthiya Purayil Sneha (India); Mandana Seyfeddinipur (Iran/Germany); Hillary Juma (Kenya/United Kingdom).

Martin Dittus walked attendees through the trends identified from a survey of 11 websites, 12 Android apps, and 16 iOS apps. He demonstrated how interface language support across these platforms and apps is limited to particular languages: certain European languages with a colonial history (like English, Spanish and Portuguese), and certain regionally dominant languages (like Arabic and Mandarin Chinese). Some platforms do better than others, with Wikipedia, Google Search, and Facebook supporting more than 100 languages each. Meanwhile, over 90% of African users have to rely on a second language in order to navigate digital platforms. As Martin highlighted, what stood out from the data was the need to “fundamentally reset our expectations”. According to him, “the question isn’t ‘how do we support one or two languages’. The question is, ‘how do we support 7000 languages and make sure the people speaking these 7000 languages are supported in their online experience’”.

These numbers come to live in the platform survey made available in the report, and in the cases analyzed in more detail. Researchers Martin Dittus and Mark Graham explore the language geography of two widely-used platforms: Wikipedia and Google Maps. After investigating the inequities within 300 Wikipedia languages, they demonstrated that editions vary greatly in scale, both in terms of number of articles and size of their editor communities. On average, Wikipedia language editions only have a small fraction of the content available in English. However, these discrepancies can’t be fully attributed to the number of speakers in each language. Languages spoken by hundreds of millions of speakers, like Mandarin Chinese, Hindi, Modern Standard Arabic, Bengali, and Indonesian, have much less comprehensive editions on Wikipedia. This inequality is also reflected in Wikipedia’s geographic coverage: many of the world’s places are written about in the English Wikipedia, while the geographic coverage in other language editions is often much more constrained.

Meanwhile, similar patterns appear on Google Maps in the 10 most widely spoken languages in the world, with observations about local content coverage in Kolkata, Dar es Salaam, and Nairobi. Overall, Google Maps remains dominated by English-language content, and coverage in certain major languages is highly constrained to specific geographic regions — like Bengali and Hindi, with maps largely limited to South Asia. The research also indicates that, when search results are not available in a given language, Google seeks to address these gaps by including foreign-language content — which is, most of the time, available in English.

Compared to major world languages, speakers of less widely spoken ones are not nearly as well-supported, and their languages are often entirely unrepresented on Google Maps. By examining the coverage of Zulu and Xhosa in South Africa, and Guaraní in Paraguay, Martin, and Mark found that these languages are essentially not represented on Google Maps, despite being spoken by millions. These unequal content distributions partly reflect existing language geographies and population sizes, but also reflect the linguistic properties and social circumstances of languages.

Both stories and numbers pointed to shared issues that weren’t merely linguistic, but political. “This report addresses the scarcity of data on language and makes things visible on the scale that is necessary for political action to take place, showing that the web and [its] tools are dominated by the usual power holders”, said Mandana.

Imagining together

Participants closed off the discussion with a reflection on what they would like to see in a multilingual internet, moderated by Adele Godoy Vrana. Interventions included wishes for multilinguality and multimodality online, as well as for the prioritization of the needs from minority language communities. When asked to share the emotions with which they would leave the session, attendees took to the chat and turned their microphones on to speak in different languages: gratitude, inspiration, hope, joy.

Gratitude

Our communities, including our funders, are a fundamental part of this work. They not only shared with us their knowledges, experiences, and imaginations; but contributed substantially in researching, creating, and reviewing its content in 13 languages. We want to express our endless gratitude to all of them, for their trust and allyship. For getting a better sense of the wonderful people involved in this process, read our gratitude section or check out our team page.

More to come!

This report is a work-in-progress and an ongoing research initiative — that is, its launch party was only the first step. The State of the Internet’s Languages website (currently available in English, Bahasa Indonesia, Bangla, Portuguese, Swahili, Spanish, Zapotec, French and Arabic, with Taiwanese Mandarin and International Sign coming soon), is constantly being updated, and we’re open and grateful to translators who’d like to bring it to other languages and to partners who are willing to join and support us.

We’re keeping the conversation going on our different channels under the hashtags #InternetsLanguages, #IdiomasdaInternet, and #LenguasenInternet on our social media (Twitter, Instagram, and Facebook). Read more about how to engage with us. Revisit the slides of the launch event, watch below the recordings from the launch party in the different languages it was available… and don’t forget to share the report widely!
















Author Profile