Filling the World’s Wikipedia Gaps

by Mark Hay

October 27, 2015
Image by Johann Dréo via Flickr

Last month, a South African Redditor going by the handle lovethebacon took to the site’s r/southafrica forum to share a weird experience he had while surfing Wikipedia recently. He noticed that the Zulu-language page for Nkandla, a town of 3,557 people in KwaZulu-Natal Province, South Africa’s second-largest (and fairly well-developed) province, ended with the following phrase: “Nazo isintandane ziningi lengculazi. Iyidolobha impofu.” Roughly translated, lovethebacon explained, this means: “Orphans [here] have HIV. [This is the] capital of the poor.”

That’s a broad statement, both uncouth and untrue, so it’s understandable that the Wikipedia entry would raise a hackle or two. But given the size of this crowdsourced, philosophically anarchic digital encyclopedia, we in the West are accustomed to the notion that we’ll come across a stinker or two while browsing around. The site itself even acknowledges this, cautioning that there are only so many airtight, authoritative articles in its database. Many of us believe that once we point out offensive blips and glitches, dutiful editors will come along and fix them.

Yet in the case of the Zulu Wikipedia and many others, that belief may be unfounded. Not only are non-English Wikipedias on par smaller, but they also tend to have fewer editors, meaning they run a greater risk of perpetuating questionable information within a society—a situation that doesn’t seem about to change anytime soon. Fortunately, a few new auto-translation apps are coming to the fore to tackle this entrenched transcultural Wiki disparity.    

Before detailing these auto-translation initiatives, let’s establish just how needed they are by outlining the contours of Wikipedia’s intellectual disparity: We’ve long known that Wikipedia’s coverage, even in English, is uneven. As of 2013, the site’s editors were 90 percent male, which translated into a landslide of in-depth articles about female porn stars and a comparative dearth of information about sub-Saharan African cultures and cities—like Nkandla. 

As of August 2015, there were 291 Wikipedias, 280 of which were active, together comprising almost 36 million articles. The English Wikipedia accounts for 13.8 percent of that content, while seven European languages and two spoken in the Philippines (Swedish, German, Dutch, French, Waray-Waray, Russian, Cebuano, Italian, and Spanish, in that order) account for 37.5 percent of the remaining articles. When researchers at the University of Oxford visualized the content created in various languages in 2011 and 2012, they found that smaller Wikipedias tended to have much more selective information, often just creating stub articles using simple templates and well-established information from other languages, meaning the smaller your Wikipedia is, the narrower the information it presents is likely to be.

Fortunately, while some major languages have very small sites, they have fairly high editing ratios, meaning that what information they do present is usually better vetted, in-depth, and more well considered than something like the article that lovethebacon stumbled across. (Native speakers of Arabic, Bengali, Hindi-Urdu, Mandarin Chinese, Portuguese, and Punjabi account for about a third of the global population, but corresponding Wikipedias only make up about 6.5 percent of all the encyclopedia’s entries.) But even well-edited articles can present drastically different levels of information from one language to another. (Spanish speakers, for example, have inexplicably more information on cats available to them than English speakers.) And the sites for many languages are both tiny and poorly edited—55 languages, including Zulu, have no administrators, and over a dozen more have an administrator but no active content creators. For those who don’t know English in those regions, Wikipedia won’t do them much good. And even if they do know English, coverage biases will still limit the usefulness of more developed sites.

It’s tempting to say that these people can just get their information elsewhere. But that’s just not true these days. Wikipedia is the seventh-most-used website in the world. Its success killed off Microsoft’s Encarta, which drew from professionally vetted encyclopedias, in 2009, and has driven the comparatively tiny Encyclopedia Britannica’s 120,000 articles behind a paywall to survive. In the quest for free information, search engines like Google and Siri automatically source their information around the world from Wikipedia. And organizations providing free internet in the developing world, like Facebook, often limit unfettered access to a few sites dubbed essential to or extremely useful for modern life—like Wikipedia. It’s an unavoidable resource, and if your version of it sucks, you can see a cascade effect in the making. 

To some, this disparity might not seem worrying; it might seem like you can discount it as the side effect of a head start for older sites. Once a Wikipedia takes off, it seriously takes off: The English-language Wikipedia, founded in 2001, had just 750,000 entries in 2005, but a decade later has grown to just under 5 million. Meanwhile, in 2005 just over 100 of the almost 300 extant Wikipedias had been established—and many are growing in less connected regions of the world, amongst languages with fewer native and second-language speakers than English. As time goes on, one would hope that proportional advances will be seen in other languages—especially as internet connectivity and access to information resources continue to skyrocket.

Unfortunately, the hope that the disparity in wiki-offerings will right itself is misplaced. Thanks to a cumbersome self-created bureaucracy and a culture of inter-editor sniping, in which established voices make it hell for newbies who make minor errors or can’t keep up with the culture and lingo of Wikipedia (among other problems), fewer and fewer people worldwide are becoming contributors, editors, or administrators. Between 2007 and 2013, the global volunteer base actually shrank by a third.

Given the loose structure of the utopian Wikipedia, the tiny Wikimedia Foundation behind the sites has only so many tools to address this editorial decline. And in the past, existing editors have been somewhat hostile to major overhauls that might welcome more diverse and robust volunteer pools. To date, they’ve focused on addressing knowledge disparity and contributor decline by making minor tweaks in the editing software, making it simpler to ease into the process slowly, and adding encouraging thank-yous for the efforts. But that’s hardly a guaranteed way to plug the massive knowledge gaps developing across cultures in a timely or even reliable manner.

The whole situation can feel a little futile—a depressing reaffirmation of entrenched inequalities born out of what was supposed to be an accessible, egalitarian, and idealistic site. But fortunately, a few researchers have started to devise solutions to overcome these knowledge disparities. The answer, they say, is to mine complementary data from across all the world’s Wikipedias and then to translate that information back to your native language site, thus attaining the online encyclopedia’s egalitarian ideal.

This seems like it would be a hard project to take on. But we’re already well on our way to making such data mining and translation possible. In 2012, researchers at Northwestern University developed a system known as Omnipedia (possibly named after a device in the Stargate or Star Trek universe), capable of culling, comparing, and automatically translating data from 25 different Wikipedia language editions simultaneously, presenting them in simplified form. Omnipedia is still under development, and looks a little tricky to use. But around the same time, Italian researchers developed Manypedia, a much simpler and already publicly accessible tool that can automatically translate two Wikipedia articles side by side and point out incongruous information between them—or just translate an existing article into a different language.

The ability to cull data from one Wikipedia and transfer it to another won’t solve all of the sites’ knowledge disparity issues. Between regional biases and the anemia of some Wikipedias, there are certain subjects that just don’t have enough data from which to cull (e.g., there are more entries in the world on Antarctica than on any nation in Africa or South America.) But these initiatives are still going a long way to plug the most glaring gaps. Plus, they can benefit those of us with access to the robust English-language Wikipedia, pointing out differences in coverage between our culture and others, especially on subjects we find controversial and other societies do not. That’s a kind of depth and awareness that most of us never manage to get even in our information-dense environment. At the very least, if these initiatives gain a bit of traction, they can start a serious conversation about continued shortcomings and differences between Wikipedias, driving us toward more systematic changes and tactics that can fill the world’s glaring content gaps once and for all.

Recently on GOOD
Sign up to receive the best of GOOD delivered to your inbox each and every weekday
Tucker Carlson caught hiring actor to pose as protest organizer on his show.
Filling the World’s Wikipedia Gaps