User:Facing the Sky/Wikipedia is not done

From Wikipedia, the free encyclopedia

There's an idea in some parts of the English Wikipedia community that Wikipedia is getting near to 'done' and that there aren't many articles left to start. This view sees Wikipedia as being just short of the sum of all human knowledge, just missing a few bits and bobs there and whichever celebrity is the flavour of next week. Soon, it predicts, article creation rates will drop, and this will not be due to endogeneous community factors, but instead it will be a natural progression of the encyclopaedic process. 'Peak article' is going to herald a shift from creating new articles to improving existing ones.

This view is wrong. Wikipedia is not done. In fact, the work needed to create a truly universal encyclopaedia is only just beginning. Wikipedia contains more articles, more topics covered and more words written than the nearest points of comparison—the Encyclopaedia Britannica, the Great Soviet Encyclopaedia, the Enciclopedia Espasa, the Enciclopedia Italiana—but those works together have only covered a tiny fraction of what could be written, a tiny fraction of what readers will want to know about, and a tiny fraction of the sum of all human knowledge. Wikipedia is doing better than any other effort to create a universal encyclopaedia, but there's still a huge amount to do.

The Wikipedia globe[edit]

Article creation rate, with a Gompertz function extrapolation. Note that the model heads to zero but the actual data has diverged from the prediction.

As I'm writing this in April 2014, global population is c. 7,000,000,000 and the American population is c. 300,000,000. We know that American topics are over-represented in the English Wikipedia; the question is, how many articles could we have if we had coverage of the rest of the world that was as good as our American coverage? I'm going to look at BLPs to quantify that.

The English Wikipedia is closing in on 4,500,000 mainspace articles. ~800-1000 articles are getting created each day, which also happens to be the average rate since 2001. Category:Living people has c. 650,000 articles in it. There are c. 230,000 living American people with articles. There are c. 80,000 articles written about people who died in the 21st century. Category:Possibly living people and Category:Missing people (incl. subcategories) together are c. 4000 articles. Some elementary calculations:

  • For each biography of a person who may have been alive when the article was written, there are ~6 articles total.
  • 0.08% of living Americans have articles. If this number was reflected worldwide, there would be c. 5,600,000 BLPs.
  • Hence, if the whole world had the same coverage as America has, there could be c. 33,600,000 articles total.

This is a big number. It's probably something of an over-estimate:

  • The proportion of topics that were done and dusted before the founding of Wikipedia ('historical') or have very little time-sensitivity (e.g., physical geography, biological taxa; 'non-temporal') will fall as the 'backlog' of history and non-temporal topics is cleared, so we might expect the BLP ratio of new articles to rise. (It would be nice to get some idea of how many historical articles Wikipedia has for each period, but getting that data looks Hard. Perhaps a first guess could count up people who lived in each era by going through Category:Births by year.)
  • Modern America, and the West generally, are very well documented compared to the rest of the world, creating a 'reliable sources gap': a non-Western topic may not have good enough sources to meet WP:N or WP:V when a comparable Western topic wouldn't have problems.
  • Many articles about important topics from under-represented areas have already been written, even if the ancillary coverage remains poor. In under-represented areas, the fraction of BLPs yet to be written out of all articles yet to be written may be much higher than the overall fraction of BLPs out of articles.

At the same time, it's not necessarily an unreachable upper bound:

  • Many notable people do notable things, some people who aren't notable do notable things, and then, as Harold Macmillan definitely didn't say, there are "notable events, dear boy, notable events". We should expect the fraction of new articles that are BLPs to stay well short of 100%. I would be shocked if it even gets above 50%.
  • There are present-day topics which only come to be notable in hindsight (e.g., an article about the Willendorf Venus might well have been deleted as 'NN sculpture, never exhibited, no significant mention in prestigious cave paintings' in the Cro-Magnon Wikipedia—it only gained notability in the thousands of years since). The future Wikipedia will include many current things that aren't notable now, but it will also include everything that's notable today.
  • The global population is skewed young. Many young people who will become notable haven't become notable yet. This number is potentially large compared to the total number of humans to have lived. (Exponential growth is fun!)
  • I don't think there's saturation in historical topics. My current hobby is writing about 19th/early 20th-century Austrian art, particularly early members of the Vienna Secession and people around Jugendstil in general, and let me tell you that there are an awful lot of gaps for a recent, well-documented, European subject.
  • I'm not even sure there's saturation in current Western topics. Women? Ethnic minorities? Non-Anglophone Europe? Just generally obscurer subjects?
  • Better documentation and more potential verifiability will follow economic growth, and current trends are for enormous medium-term growth across the developing world. There's no reason to suppose that a majority of the world won't have the same density of reliable sources in 2064 as the West has in 2014. The 'reliable sources gap' will become less of a problem. That's not to say it will close: the West might get even more documented in the meantime, but if that happens the global notability rate will go up anyway.

The English Wikipedia might not even be half-way to the goal. 4,500,000 articles might not have been sufficient to encompass human knowledge in 1911 when the world population was c. 1,750,000,000, ~25% of today's.

'Peak article' and the sum of all human knowledge[edit]

The extended growth model from 2009 predicts article creation rates going to zero, but did not factor in the vast size of world population, nor that new to-be-notable people are being born.

I think it's pretty reasonable to assume that total human knowledge is proportional to the total number of humans. This argument could be phrased in terms of man-hours: some proportion of human effort is spent on creating knowledge, which we might expect to be constant through history. Between the year 1000 and April 2014, about 2,400,000,000,000 human-years (2.4 Th·y) were lived, give or take a few thousand billion.[1] On a day in April 2014, about 20 Mh·y will be lived. Hence, if knowledge is proportional to human effort, we would expect a 0.001% increase in the number of notable topics over the period, or about 45 articles a day from a starting point of 4,500,000 articles. However, this assumes that Wikipedia has already encompassed everything that was worth knowing up to April 2014, which is very unlikely.

Between 2001 and 2014, around 20 Gh·y years were lived—that is, since Wikipedia's founding, we expect total human knowledge to have increased by about 1%, and so we would expect over 99% of topics on Wikipedia to concern matters from before 2001 if Wikipedia accurately reflected total human knowledge—but this is far from the case. It's also clear that Wikipedia is currently growing much faster than what we would expect if it was just tracking increases in knowledge—800 new articles a day implies a total size of 80,000,000 articles (which is, again, unrealistic because it assumes none of the new articles are covering historical knowledge).

In 2012, there were about 134,000,000 births. Let's assume that world population reaches equilibrium at that level of births (if the average lifespan was 80 years, that implies a global population of a little over 10 billion.) That's ~367,000 births per day. Through their lifetimes, 0.08% of those will become notable and have an article written about them—293 new BLPs a day. Assuming that 17% (1/6) of new articles are BLPs, that's ~1760 articles a day. If we assume that 50% (1/2) of new articles are BLPs, that's ~585 articles a day. If we combine this estimation with the above calculation that human knowledge is currently growing at about 0.001% per day, we can estimate the current potential size of the sum of human knowledge as being between 176,000,000 articles (1/6 of new articles are BLPs) and 29,000,000 articles (all new articles are BLPs).

All of these putative totals are over-estimates for another reason: they're ignoring lost knowledge. I'm not even going to try and guess at what fraction of knowledge has been lost, but estimates at the total number of Wikipedia articles in this section should be scaled by whatever your personal guess is.

Whatever way you look at it, the English Wikipedia has a lot of growing to do if it's going to reach the goal of being the sum of all human knowledge. Wikipedia hasn't necessarily hit 'peak article'—human knowledge is going to be refilling the 'well' of topics to be written, and the current rate of 'drilling', or perhaps slightly less, could be sustainable in the long term. 'Peak history' might have come and gone already, but 'peak current events' may still be coming, depending on global demographic trends—and it will be 'long plateau current events', not 'peak current events'.

A reference work[edit]

The Allgemeines Künstlerlexikon (AKL) is a German-language encyclopaedia of visual artists, taken broadly. It is the reference work in this field. The publisher's information page claims that it includes 500,000 articles on artists, and adds 3,500 each year. Wikipedia has at least 144,000 biographies of visual artists. (Catscan seems to choke on Category:Artists so I picked Category:Artists by nationality to try and still get a full count.) Let's double that, under the assumption that Wikipedia's idea of what a 'visual artist' is is narrower than the AKL's and that there are errors in Wikipedia categorisation. More elementary calculations:

  • ~6% of articles on Wikipedia are biographies of visual artists.
  • So, if Wikipedia had the same coverage across the board as the AKL has of visual artists, Wikipedia would have c. 8,300,000 articles.
  • Further, Wikipedia would be adding ~160 articles overall each day if it was growing at the same rate as the AKL.

This is rather lower than the above estimate of 33 million, despite also being rather larger than the current article count. Perhaps visual artists are very well represented in Wikipedia compared to other subject areas, despite missing many articles present in the AKL. Additionally, the AKL is going to be subject to many of the same biases as Wikipedia as a Western effort—perhaps it's more that Western visual artists are already well represented on Wikipedia, and a comparison falls victim to non-Western under-representation in the AKL too. Perhaps I'm not optimistic enough about the quality of Wikipedia's categorisation.

(If anyone knows how quickly Britannica Online is growing, I'd love to hear it and to make some extrapolations. The website is quite tight-lipped about the total amount of content—understandable, since they're competing with Wikipedia on quality and not quantity.)

Notes[edit]

  1. ^ Data from world population estimates. An exponential model was fitted to the data, 1000–present. A rough guess at the amount of human-years lived before 1000 AD is on the order of 1 Th·y, but a lower total produces more conservative estimates of the potential number of articles.
  • Throughout, please read 'notable knowledge' for 'knowledge', 'notable topics' for 'topics', etc, as necessary.

See also[edit]

  • User:Emijrp/All human knowledge (2014) – estimates the size of "all human knowledge" to be at least 96,000,000 articles. Takes an inside, top-to-bottom view, estimating the total article count in a topic area by estimating the count in each subtopic.
  • User:Piotrus/Wikipedia interwiki and specialized knowledge test (2006, revised 2013) – estimates total potential size to be 50,000,000 articles. Takes an outside view, randomly sampling articles and counting which are present and missing in other reference works, including other language editions of Wikipedia.