Thursday, April 03, 2008

Wikinalysis

You all know how I feel about Wikipedia.

Regardless of my suspicion of it as a reliable source of information about anything other than Star Wars minutiae, I have to admit it is a pretty interesting phenomenon. Specifically, I've long been curious about the length of the articles - is there any correlation between word count any anything relevant? does the "importance" of a topic dictate the number of words on Wikipedia? I think it's a decent metric - writing Wikipedia articles takes time (if not effort or knowledge). Someone really has to care about Wil Wheaton to spend an afternoon composing an online biography of him.

In another sense, it's a tough question to ask. For instance, if asked about the relative importances of, say, evolution vs. American Idol, how is one to judge that? I'd have to lean towards evolution (what with it being more important and all), but you'd be hard pressed to find 150 million Americans that don't believe in the existence of American Idol. The length of the two articles differs only by a scant 957 words. So what's more important?

A further complication is the existence of "subarticles" in Wikipedia. For instance, right at the beginning of the evolution article, there are links to a general article on the topic, and a definition of evolution as theory and fact. The remainder of the article is peppered with such deeper links.

Traditionally, questions such as these have been answered through the Googlefight, a high-tech semantic multiplexing analysis which compares the number of Google search results between the two terms.

Evolution is the winner, by almost 200 million results. Sorry, Reuben.

It's possible to do the same comparison on Wikipedia's internal search engine, which gives the same result. So, we have a nice set of data for comparison. Article length versus search engine results.

My collaborator, Tim "Needles" Morgan, wrote me a nice ruby script that would take an input phrase and output the following:

The word count of the Wikipedia article
The number of Google search hits
The number of Wikipedia search hits
And, as a bonus, an "aggregate" score that includes the word counts of all the articles returned by the Wikipedia search.

So, if a search for an article returns five hits with ten words apiece, the "aggregate score" would be 50. So, I picked 30 terms I thought it would be interesting to see some of these results of, and ran them through the Ruby script. I also searched for the terms on SciFinder Scholar, a scientific journal and patent indexing service.

The raw data is available on request, if you're interested.

The graph of the data was much more interesting than I expected. All the data has been plotted to a common scale.


All three of the search engines seem to be much more "selective," with an extremely sharp drop in results after the first one or two hit items. The word count scores, however, tell a very different story. The amount of words counted in articles varies much less, indicating that any old useless piece of information on Wikipedia deserves the full attention of the pedants.

Perhaps unsurprisingly, Wikipedia and Google didn't entirely agree on which terms were more important. Here's the "top five" for each search result.

Word Count WikiSearch Wiki Aggregate Google SciFinder
Intelligent Design Evolution Leonardo DaVinci Evolution Esters
Evolution Final Fantasy Oprah Winfrey Mozart Evolution
American Idol Pornography Evolution Taxonomy Quantum Mechanics
Lightsaber Combat Taxonomy Quantum Mechanics Pornography Organometallics
Oprah Winfrey Mozart Mozart American Idol Protecting Groups

Things that pop out at me when I look at this:

According to word count, lightsaber combat is only slightly less worthy of dedication than evolution. Fortunately, evolution brings up more pages on the internal search engine, and therefore has a decent aggregate score. I'd really like to know why Oprah Winfrey has more words dedicated to her on Wikipedia than evolution.

Wait, no, I really don't want to know that.

I'm a little surprised at pornography's position on Google, but I imagine that most websites dedicated to that would come up for... er... "different" search terms.

Being a good scientist, I couldn't just stop there. I had selected the search terms myself, based on my ideas of what might be considered important and unimportant. It would be terrible if such a nice graph was just generated because I didn't properly randomize my searching method.

So, Wikipedia did the randomizing for me.

My collaborator whipped up another script, one which would follow Wikipedia's "random article" link, and use the title of that article as the search term for the rest of the data. This allowed me to collect much more data in a shorter amount of time. 2080 different terms, to be precise. This time, visualizing the data produced much less of a trend.

It's clear that Google still displays the highest selectivity. Also worthy of note is the correlation between search results and aggregate score, which is not surprising since more results equals more articles equals more words.

There seem to be two distinct regions on the graph - a region where there's an exponential drop in the term score, and one in which the score remains relatively stable. If we take the high scoring region and plot it alone, the original trend re-emerges!

Ignore "words per article", it isn't supposed to be there.

So that was where the original trend came from - the terms I had selected were all high-scorers, and so only generated this section of the graph.

I know you're dying to know what some of these terms are. Here's a bar chart of the top five scoring terms from each (click to embiggen):
This chart represents the normalized scores of each term from each method added together. The taller the bar, the more "important" the term overall.

The Google scores seem a bit fishy to me. Why would Robert "Tex" William Richards Jr. return more hits on Google than The Destroyer? According to a quick Googling, it doesn't. Three of the five top Google results don't match the collected data, which is puzzling. A randomly selected batch of 20 results from the rest of the Google data comes back good, so why are those ones so far off? Get back to me on that, Tim.

A closer examination of some the top word count terms reveals some surprises. It's an interesting glance into the sort of thing that people like to write and argue about.

Fujiwara no Teika (10,208 words), the top scorer, is half footnotes. Who writes an article that's half footnotes? What Wikipedian has resisted applying the cleanup tag to this bottom-heavy mess?

Critique of Pure Reason (7410 words) is about the book by Kant. Apparently, internet people really like to talk about Kant. I met this guy in college who constantly talked about Wittgenstein. If the people who assembled this Kantian monstrosity are anything like him, I may never be able to take Wikipedia seriously.

List of school districts in Illinois (6826 words). Lists did pretty darn well in terms of length, probably because they're easy. Also high on the list of lists: List of Naruto episodes (Seasons 1-2), List of Serbs, and List of number-one singles in Australia during the 1990s. Love shack, Baby!

Please remember that this is a random cross-section of Wikipedia. Unless my collaborator has a way to scan the entire database for me, I may never know what the longest article is. My guess? List of Wikipedians by number of edits. The only thing more powerful than laziness is ego.

Another category we can examine is what I'd like to call "worthlessness." The Internet contains a lot of garbage, even though it takes time, skill, and a small amount of money to create a web page. Yet it would take anyone virtually no time or skill at all to create a Wikipedia article about their cat. Wikipedians would call that article non-notable, I would call it worthless.

Well, unless it was about my cat, who is awesome.



Since I put more stock in Google than Wikipedia, let's say that a worthless article is one that has a lot of words, but few Google hits. So, what's worthless? Here's some examples that I find funny.

Jim Harris (politician)
The Moment of Truth (Milli Vanilli album)
List of knapsack problems
Linear immunoglobulin A dermatosis
Armenian notables deported from the Ottoman capital in 1915

So, what sort of conclusions can be drawn from this?

I came away with three main impressions:

1. The length of a Wikipedia article has nothing to do with anything.
2. Regardless of its use as a primary source, it's full of juicy, fascinating data.
3. Rutherford was right - if your experiment requires statistics, then you should have designed a better experiment.

I realize this study was sort of casual, and not exactly rigorous. So if you'd like to have a look at our data, I'd be happy to send it along. Also, any suggestions about other data to investigate would be welcomed!

0 comments: