Wikidata: why we contribute to the robot epistemology

18 julio 2023

Text reads: "Whose views, values & histories get codified into structured data?"

Since 2003, the Wikimedia Foundation has helmed a series of projects, of which Wikipedia is one. Another project, launched in 2012, is Wikidata — an open, “structured” linked database that aims to numerically translate the knowledge accumulated in places like Wikipedia. An easy way to understand the workings of structured databases is to consider a statement shortened from a Wikipedia page:

“Amos Tutuola ¹ was a Nigerian writer, author of The Palm-Wine Drinkard.“

This information in this sentence can be structured by assigning alphanumeric codes to each item or property:

“Amos” – is item Q18216060 under property “given names” P735
“Writer” – is item Q36180 under property “occupation” P106
“Nigeria” – is item Q1033 under property “countries of citizenship” P27
“The Palm-Wine Drinkard” – is item Q774053 under property “notable work” P800

Orange bookcover of the Palm-Wine Drinkard with smaller replicated pattern of a beaded palmwine vessel on the right. — Structured databases work on alphanumeric codes, assigned to each item in a sentence. Image by Image by Abebooks, via Wikimedia Commons and Nick Ash, Berlin – BrückeMuseumBerlin via Wikimedia Commons.

This kind of structured data is also linkable to other data. For example, the novel “The Palm-Wine Drinkard” is the item Q774053, but this isn’t only an alphanumeric code. It has also a link to the novel’s page, where the data description of the work is available, and linked back to its author’s page, Amos Tutuola (Q361562). Wikidata also connects to many other external databases, like the Library of Congress, Open Library, and so on.

This kind of reconstruction of text and its interlinking to other data allows information to become machine-readable. In fact, this is the crucial difference between Wikipedia and Wikidata – one meant for human learning²and the other, largely for machine learning.

Wikidata can have incredible positive value. Consider that the article on Amos Tutuola already exists on Wikipedia in Yoruba, English, and Spanish, among others, but not in Hindi. Bots reading Wikidata’s numerical semantics could potentially quickly reconstruct an article on Amos Tutuola in Hindi. This is a huge potential for linguistic expansion on Wikipedia and is being rapidly explored³. Wikidata has also found uses among libraries, journalists, wiki researchers, and many others, allowing them to widen the landscapes of possibilities in their fields.

However, we must be cautious of overoptimism in this work. In translating structured data in this manner, we predictably translate all systemic bias issues on the original databases, like from Wikipedia to Wikidata. This can create a feedback loop of biases from Wikidata to other Wiki projects, and then to well beyond the Wiki ecosystem. The issues of vandalism, restrictive examinations of “notability” or “reliability of sources”, of representation gaps among editors and administrators, inaccuracies, and biases are not resolved on moving to Wikidata. They are just translated to new semantics.

Wikipedia’s overrepresentation of northern-centered, Anglophone, and Eurocentric editorship and content are well-known. Even as the Wikimedia Foundation continues to extend support for representation to people from marginalized identities, a study in 2021 sees that even the gender gap hasn’t been bridged. On Wikidata, too, these representation gaps continue to worry. Only 22% of the content in Wikidata is recorded to be about women.

People of the Global Majority countries too are battling several barriers to participation in these commons-based peer production spaces. If they do get a seat at the table, many communities find that they have to force-fit their knowledges into uncompromising Western epistemic models like Wikipedia. How do people “reliably source” their knowledge when the top 10 publishing conglomerates of the global north account for more than half of all the revenue generated by publishing worldwide⁴? Reflecting such biases, only 0.3% of sources used on Wikipedia are actually African in origin⁵.

To their credit, Wikimedia Foundation has been putting considerable and admirable efforts to diversify. Among these efforts include the advent of a global Knowledge Equity Fund^6,7, which was founded after the killing of George Floyd in 2020, to commit to processes of racial equity and help offset inequity in knowledge production. And such efforts are paying off. The diversity of content is increasing slowly year by year. But these efforts are still a slow work in progress and seem to require continuous pushing by Wikimedians and knowledge justice advocates to keep the efforts kicking.

Source: Wikipedia Diversity Observatory, 2020

Even as gap-closing efforts take place, there is fear that structured data projects like Wikidata are codifying the views, values, and histories of hegemonic communities⁸ into artificial intelligence, establishing them as truths in the robot epistemology, and likely trapping people of the Global Majority in laborious knowledge battles to extend into the eras to come.

At this juncture, it’s really crucial for us all as Wikimedians to ask what or who is Wikidata ultimately benefiting and how. Many of us are also Wikimedians and Wikidata contributors (including this author). We are often doing this work because we believe in the broad idea of open and collective knowledge-making. So it’s absolutely crucial to ask what exactly we are contributing our labor and time to. One study⁹ records that, unlike Wikipedia contributors, over 70% of the Wikidata contributors in their study did not have an understanding of what their contributions were being used for. In the words of one of the contributors, they felt they were “pumping the information into an empty void”. Others report similar experiences. Long-term feminist Wikimedian, and coordinator of the Decolonizing Wikimedia campaign, Mariana Fossatti, says she does ask herself, “Why am I doing this? Even when it’s fascinating to do it. I don’t want to play with gadgets, I would like to see Wikidata-based tools with purpose.”

In our Wikidata effort, it seems that we are distanced from the outcomes of our labor. We already know now that data being hosted on Wikidata has been crucial in the training of machine learning chatbots like ChatGPT, as well as language-based assistants like Amazon’s Alexa and Google’s Google Assistant¹⁰. Despite the open access with which Wikimedia knowledge production ensues, when the unpaid labor of thousands of Wikimedians then goes to big tech companies to make a questionable profit, we must ask ourselves; did we sign up for this? Never mind the recklessness with which these corporations are imposing systems of “AI” and automation onto vulnerable populations. Why should we contribute free labor to produce open work only to have corporations subsume the work and close the knowledge gate behind them using their own shutdown copyrights.

In the study cited above¹¹, some contributors did not care how their contributions were being used. But many contributors from marginalized communities and from the Global Majority do care, and care profoundly. If you think about it, Global Majority communities are much more resource-limited (often as an outcome of several historical injustices like colonization, racialization, caste, economic or gender oppressions). Broadly speaking, people of the Global Majorities have less time to contribute towards online unpaid labor. This has even been witnessed in Wikidata initiatives like Reimagining Wikidata from the Margins¹². When we do have the time to spare, I believe we weren’t necessarily thinking about how it would assist the profit agendas of the multinational companies. Most of us do this because we want it to reflect the lives, histories, and cultures of our people and communities and fortify our realities on a platform that aims to collectively archive the knowledges of the world (but often overlooks/erases ours).

Instead, again we are left with the dilemma. If we don’t engage with these peer-production knowledge platforms, our histories, ideas, knowings, get left behind and the erasure continues. When we do, it profits far away multinationals with clandestine workings, bottomless pockets, and unchecked power over peoples’ lives and livelihoods.

At this moment, then, it is absolutely crucial to sit with these uncomfortable truths and problematize our own involvement in the commodification of our knowledges. Why do we contribute to this robot epistemology?

In order to begin some conversations among Wikimedians of the Global Majority, I propose some questions:

Is free and open knowledge really free, and if it is – who is it free for? Who is it open to?
Is an absolute open knowledge culture really the best approach when contextualized to a world full of historical and contemporary inequities?
Do we want to be data inputters or knowledge producers?
As inputters/knowledge-makers, is our labor unvalued/undervalued?
Can we collectively own and have control of the outcomes of our labor, of our knowledges?
If we can’t un-design the flaws in these models not-of-our making, should we design our own?
How, and with which resources, do we build the data projects that really make sense for us?

Whose Knowledge? has already begun some conversations on these fronts and in the coming months, we hope to be able to further explore and report on more of these questions collectively.

Notes

¹ The last name Tutuola as an item has not yet been added to Wikidata so as of writing this article, this sentence remains incompletely structured.

² “In terms of human learning, Wikidata offers possibilities too, like the possibility to work on queries and visualizations that provide insights about a wide variety of topics for being explored, including about open knowledge itself, its biases and gaps.” (Mariana Fossatti, personal communication, June 14, 2023).

³ Kaffee, LA. et al. (2018). Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders. In: , et al. The Semantic Web. ESWC 2018. Lecture Notes in Computer Science(), vol 10843. Springer, Cham. https://doi.org/10.1007/978-3-319-93417-4_21

⁴ https://www.publishersweekly.com/binary-data/Global502019.pdf

⁵ Graham, Mark, et al. «Uneven geographies of user-generated information: Patterns of increasing informational poverty.» Annals of the Association of American Geographers 104.4 (2014): 746-764.

⁶ https://wikimediafoundation.org/news/2021/09/08/wikimedia-foundation-announces-first-grant-recipients-of-new-4-5-million-equity-fund/

⁷ The author is a committee member at this fund

⁸ https://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikidata/255564/

⁹ Zhang, Charles Chuankai, et al. «Working for the Invisible Machines or Pumping Information into an Empty Void?

¹⁰ Google in fact is one of the big funders of the Wikidata project after merging their database Freebase with Wikidata in 2014.

¹¹ Zhang, Charles Chuankai, et al. «Working for the Invisible Machines or Pumping Information into an Empty Void? An Exploration of Wikidata Contributors’ Motivations.» Proceedings of the ACM on Human-Computer Interaction 6.CSCW1 (2022): 1-21.

¹² At WikidataCon 2021, Érica Azzellini, a member of the Reimagining Wikidata from the Margins, notes the following “(On a side note), lack of time was a major constraint for participation at any level, including the participation in the Reimagining Wikidata from the Margins conversations”. https://www.youtube.com/live/wn2BrQomvFU?feature=share&t=8139

The Whose Knowledge? Journey: looking back at 2022

Posted diciembre 14, 2022 by Anasuya Sengupta, Azar Causevic, Constanza Verón, Claudia Pozo, Kelly Foster, Mariana Fossatti, Perse(phone) Hooper Lewis, Priscila Bellini, Shamillah Wilson, Sunshine Fionah Komusana y Youlendree Appasamy

A series of word bubbles with the word ‘abortion’ in different languages sits above a yellow and orange background with abortion march badges from the UK

#16DaysOfActivism: Can you get reliable information about safe abortion, in your language, online?

Posted diciembre 9, 2022 by Youlendree Appasamy, Sunshine Fionah Komusana, Mariana Fossatti y Claudia Pozo

Image of an Equality Labs sticker showing a dark-skinned femme-presenting person holding a sign saying 'End mass surveillance'

#16DaysOfActivism: Our bodies, our data? Online surveillance and criminalization

Posted diciembre 5, 2022 by Youlendree Appasamy, Sunshine Fionah Komusana, Mariana Fossatti y Claudia Pozo

Author Profile

Maari Maitreyi

Maari Maitreyi (she/they) is the Knowledge Justice Researcher on the Whose Knowledge? team. She is a feminist, artist and scholar interested in digital knowledge-making cultures.

Maari Maitreyi

Maari Maitreyi (she/they) is the Knowledge Justice Researcher on the Whose Knowledge? team. She is a feminist, artist and scholar interested in digital knowledge-making cultures.