While doing preliminary work for a project that uses Wikipedia as a dataset to classify languages, I noticed some oddities in the size of certain language editions of Wikipedia. I collect some of my observations here.
The Wikimedia Foundation maintains a list of all Wikipedia languages with some important stats. There are 296 editions of Wikipedia, of which 285 have more than 100 articles (the remaining 11 are all almost entirely empty). Since I wanted to estimate the amount of training data that would be available for each language, I started by looking at the number of articles for each project. As expected, the English Wikipedia is the largest, followed by are Swedish, Cebuano, German, Dutch and French.
The reason why Cebuano (a language from the Philippines that is spoken by ~20 million people) and Swedish (which has even fewer speakers, but is from a much richer area) rank so high in this list is Lsjbot, a bot initially developed for the Swedish Wikipedia that automatically creates pages based on databases or other language Wikipedias. More than half of the Swedish edition, and the vast majority of pages in the Cebuano edition, were generated by this bot (try it yourself: go to a random page on the Swedish or Cebuano Wikipedia and look for the Lsjbot header).
Many people have noticed this before, and alternative methods of ranking Wikipedia projects have been developed. One officially-supported metric is the Wikipedia article depth, defined as
The depth tries to take into account the number of edits (which will be low for bot-generated content, since most pages are created and never touched again, and high if the articles are often revised), as well as the number of non-article pages. It seems to work well on the bot-generated Wikipedia projects: Lsjbot was launched in 2012 on the Swedish Wikipedia, whose depth has dropped from 50 on January 1st, 2012 to 5 today. The Cebuano wiki has a depth of 1, since it only has about 3 edits per article. The English Wikipedia has the very high depth of 947, with other Wikipedias having normal depths in the 100-200 zone.
I call depth-weighted articles the number of articles multiplied by the depth. I have also looked at the size of the pages archive, compressed in bzip2 (the one you can get at
https://dumps.wikimedia.org/<lang>wiki/latest/<lang>wiki-latest-pages-articles-multistream.xml.bz2, which contains both articles and non-article pages, but gives a good idea of the actual information content in the Wiki, since very similar or template-based pages will compress very well. I am also including the number of active users, which Wikimedia defines as registered users having performed an action in the last 30 days.
Here are the $r^2$ coefficients between a number of metrics:
|Articles||Total||Edits||Admins||Users||Active Users||Weighted articles||Size|
I have not shown raw depth in this table, but it correlates very weakly with all of the other values. Of note are the facts that depth-wieghted articles correlates highly with the number of users, and that the number of users correlates highly with the number of edits. This shows that depth can be estimated well enough from the numbers of users and articles in a Wikipedia, and that it gives little additional information.
Another possibly relevant metric is the size and distribution of the vocabulary of the Wikipedias, since I expect bot-generated articles to be highly redundant.
It is tempting to believe that two major factor that determine the scope of a Wikipedia project are the number of speakers of its language and their wealth (since poorer individuals will probably have less of the time, academic knowledge, and Internet access required make significant contributions to Wikipedia).
The Unicode consortium has rather precise information on language GDP. This may not be the right metric: a few million rich speakers possibly have more time, knowledge, and desire to contribute to their Wikipedia than a hundred million poor speakers. Additionally, since many countries are multilingual, it’s very difficult to account for the differences in GDP or development level between langage communities.
Another paper by M. Rask in 2007 studied two main economic indicators: Human Development Index and the level of Internet penetration, on a sample of 11 Wikipedia editions.