Limits of Wikipedia


Dear reader,

I want to write a new article for the Trollheaven blog, but I’m unsure which topic is the right one. Might you help?

If the result of the poll is unclear, because two favorites have the same amount of votes or no one has pressed the button, a peaceful random generator is started.

Update 2018-10-16
The poll is over. “Limits of Wikipedia” has won. Here is the article.

Wikipedia’s next door

Sometimes, the Wikipedia is described as the standard in the internet and the best encyclopedia in the world. This description is right, if WIkipedia is compared with other mainstream encyclopedias like Brockhaus or Encyclopedia Britannica. But if we focus on scientific knowledge Wikipedia is not very advanced. To describe the problem in detail, I’ve found a simple but effective way to demonstrate the limits of today’s Wikipedia.

The searchengine Google provides a tab called Shopping for searching in commercial available products. If we enter the keyword “Encyclopedia” into the box and adjust the list-order to expensive products first, we will find a huge list of commercial available encyclopedias which are not WIkipedia. All of them are created by academic publishing companies like Elsevier, Springer and WIley. They are not general purpose encyclopedia but specialized on a topic from science. Here are some examples:

– Encyclopedia of Language and Linguistics, Elsevier, 9000 pages, 10000 US$
– Encyclopedia of Complexity and Systems Science, 10000 pages, 9000 US$
– Encyclopedia of Evolutionary Psychological Science, 7000 pages, 5000 US$
– Encyclopedia of Nanotechnology, 2900 pages, 1500 US$

The list is much larger then only these 4 books, I would guess that at least 200 different encyclopedias are available. Each of them is large and expensive. The reason why they are sold to libraries is because they are better than Wikipedia. They contain more keywords and the description is more accurate. It is simply not true that Wikipedia is the best encyclopedia in the world. It is only the cheapest one and the quality is not very high.

Explaining the difference between Wikipedia and the Elsevier/Springer encyclopedias is easy. The traffic of the keywords is different. Some keywords are also part of Wikipedia. But they have a very low usage statistics. That means it’s not a mainstream topic which generates 1000 hits per day, but it is possible that such keywords have only 2 visits per day. Wikipedia has only a few of these low traffic keywords available. If somebody needs such a specialized information he has to buy a Springer encyclopedia.

Let us estimate how expensive this would be. 200 encyclopedias each for 8000 US$ is 1.6 million US$. A huge price, but it make sense to invest this amount. Most institute libraries in the world have done so, because they need the knowledge. They can not switch to Wikipedia because Wikipedia doesn’t provide this specialized knowledge.

Bringing Wikipedia to the researchers

The Wikipedia encyclopedia is recognized as a mainstream encyclopedia which provides knowledge about the latest Starwars movie, Harry Potter books and Reggae musicians. It is accepted by a non-scientific audience as a reference for getting information quickly without asking websites which containing a lot of advertisement. The internal quality control of Wikipedia works great and avoids that spam and biased information is injected into the encyclopedia.

A researcher in a biology lab has two options. Either he can ask Wikipedia for help or he can search in a Springer Encyclopedia. The Springer version provides a much higher quality. I’m referencing to this fact because right now, Wikipedia only has replaced general purpose encyclopedia like Encyclopedia Britannica. But not specialized versions which are written by scientists. If we compare on an objective base the quality of Wikipedia with a Springer Encyclopedia of a certain topic, we will notice, that Wikipedia is weaker. That means, in most cases the lemma has no entry and if it’s available in the Wikipedia the article is too short. That means, Springer is able to sell their own Encyclopedia for thousands of US$ because Wikipedia is not able to provide the needed information.

I do not know how to solve this issue. But i can give a measurement if it is solved or not. If Wikipedia has better content than a Springer Encyclopedia, the issue is solved. And to determine the progress it is necessary to compare both sources. Left we open the article in Wikipedia and right we open the article in a Springer handbook. The difference is, that a specialized version explains every detail of a subject. The audience is not the whole world, but a researcher who is interested on a concrete subject and has a lot of background knowledge. This kind of audience is not happy with today’s Wikipedia. The problem with Wikipedia is, that it only provides general knowledge but has many missing topics in scientific sub disciplines.

To overcome the problem it is necessary to create articles in Wikipedia with a low amount of visitors. That are specialized entries which are relevant for not more than 100 people worldwide and which will generate only 1-2 visits per day. These subjects are not very attractive for Wikipedia authors because if an article is not read by the public it is useless.

The good news is that the overall structure of Wikipedia doesn’t have to change. Specialized articles can be handled like any other article too. That means, the workflow of creating and evaluating the content is the same. The only new thing is, that these kind of articles will generate a ultra-low amount of traffic. That means it seems to specialized for a general purpose encyclopedia. But at the end it will help to increase the acceptance of Wikipedia in the research community.

Let us examine some examples from the “Springer Encyclopedia of Algorithm”. None of the following lemmas are available in Wikipedia:

– Analyzing cache misses
– Approximate Dictionaries
– Approximate Regular Expression Matching
– Approximation schemas for bin packing

The reason is, that these entries are very specialized. Apart from computer scientists nobody will use these terms. But all of them are available in the Springer Encyclopedia, and this is the reason why the Springer version is used in an Institute library but Wikipedia not.

What have these lemmas in common? They are three word lemmas. That means, the question is not what “approximation” means. (This is explained in WIkipedia very well) the question is what a certain short sentence mean. Wikipedia has only a handful of two words and three words lemmas in the database. For example “Approximation error”, “Newton’s method” and “Tolerance relation” is all explained very well in the Wikipedia. But there are many more lemmas which are more specialized and doesn’t have an article right now.

What Wikipedia can learn from Springer

Springer has a unique position to the researchers. The company is perceived as close to the problems. That means, a Springer book fulfills the needs of a researcher. What is the secret? The secret behind every Springer book is, that it is focused on a detail problem. A handbook about Nanotechnology is specialized on only this topic but describes it in detail. And the Springer encyclopedia are domain specific encyclopedia too. They are not written for a broad audience but for experts in the field.

Is it possible to transform this concept into the Wikipedia ecosystem? Yes, it is possible but it is hard. The main problem is, that today’s Wikipedia authors are not experts in their field but have a general knowledge. They have much in common with general Liberians from a public library who know from any subject a bit, but nothing in detail. In contrast, the Springer encyclopedia was written by experts which bring in a strong background knowledge. This make the content so relevant for the readers.

Wikipedia have tried to become more important to researchers in the past but failed in doing so. It was not possible to motivate existing researchers in contributing content. Instead Wikipedia has it’s strength in topics with a general interests for example movies, sports and political information. Nearly all aspect of everyday life is available in the Wikipedia, but that is not enough for a scientific encyclopedia. The future vision is to enrich Wikipedia with more specialized information which goes very deep into a subject.

I think WIkipedia can’t learn anything from classical encyclopedia like Encyclopedia Britannica or Brockhaus. Both are death today. But WIkipedia can learn a lot from Springer. The people there know more about creating an encyclopedia than the authors / admins at WIkipedia. And they are experts for specialized knowledge which is teached in universities.

On the other hand, Springer can learn something from WIkipedia. And this is how to reach a huge audience. Wikipedia has the top #1 rank in Google and is read by millions of people. Springer doesn’t have such a kind of traffic. A normal Wikipedia article has around 100 visits a day. In one year it is 182000 visits. Wikipedia is a mass medium, while Springer is a specialized medium. If Springer want’s to sell more books they need WIkipedia, and if WIkipedia want’s to get high-quality content it will need Springer.

Springer Link

Let us take a look what the commercial publisher Springer has to offer. In the section “reference works”, encyclopedia and handbooks are listed. An encyclopedia is similar to Wikipedia an alphabetically ordered list of articles, while a handbook contains overview articles which are much longer. Each subject like mathematics, engineering and physics has a huge amount of Springer reference works. It is possible to view example chapters, but the full text access is restricted to users who pay. This principle is well known under the term paywall.

What is unique in the Springer encyclopedia? It contains usually very complicated and specialized subjects. For example these one:

– Adaptive Control for Linear Time-Invariant Systems
– Boundary Control of 1-D Hyperbolic Systems
– Dynamic Noncooperative Games
– Information-Based Multi-Agent Systems

None of these keywords is available inside Wikipedia. If a researcher needs them, he has to buy the Springer book. What they have in common is that they sounds complicated and that they contains of more than a single word. Instead they are 3 words and even 4 words lemma titles. That means, it is a specialized entry for a specialized audience.

And this is the main difference between a mainstream encyclopedia like Wikipedia which is read by the mainstream and an academic encyclopedia from Springerlink which is read by researchers.

What the researchers have done in the last 10 years is to build their own Wikipedia version which is protected behind a paywall. That means, the researchers within universities reading and contributing to the Springer encyclopedias but not to Wikipedia. In contrast, Wikipedia is written by journalists, bloggers and amateurs. The Springer encyclopedias are written by real researchers with a deep knowledge in their field.

Springer Link statistics

The Springer Link website contains of 24 categories like Biomedicine, Chemistry and Computerscience. Each category has around 50 different encyclopedias to offer which are listed in the reference-work section. The total amount of scientific encyclopedia from Springerlink is 24×50=1200. Each encyclopedia costs around 4000 US$ and provides around 4000 pages. The total number of printed pages is 1200×4000=4.8 million. Elsevier, a Springer competitor, has also many encyclopedia to offer.. They are listed at the Sciencedirect website. The price tag is similar. That means a book with 2000 pages will cost around 2000 US$.

A size comparison with Wikipedia is possible. The printed wikipedia has 7473 volumes with 700 pages each, https://en.wikipedia.org/wiki/Print_Wikipedia The amount of pages is 5.2 million. While the Springer encyclopedias containing in total of the above mentioned 4.8 million pages.

Wikipedia vs. academic encyclopedias

Wikipedia strength is, that the encyclopedia is cheap and covers mainstream topics. His weakness is, that specialized lemmas from scientific fields are missing. The commercial encyclopedias from Elsevier and Springer have the opposite profile. They are expensive, but provide specialized academic topics. The content is created by experts.

Having fun with Wikipedia

In the beginning of the famous encyclopedia, it was easy to vandalize the project. Vandalizing means to destroy something, to rant against the admins and to make clear who the boss is. The best practice method in doing so is to to search for a high traffic lemma for example “Artificial neural network”, delete all the content and press the save button. Now, Wikipedia is shutdown and the world sees nothing if they need information about the topic.

After 30 minutes or so, some admin is alarmed because we have deleted his work and he is complete irritated. That means, the admin doesn’t know what is happend with his encyclopedia and he must first consult the manual to rollback the information to a previous state. In this time, Wikipedia is offline and we have won.

Unfortunately, the time has changed. Modern admins are prepared for such kind of vandalism. They are better informed how to use the mediawiki system and in worst case they will block the attacker completely which is a bad situation, if we want to vandalize the Wikipedia a bit more. What can we do, if the aim is to have a bit fun with the admins?

What a good vandal is doing is to upgrade his tools. Instead of simply clearing an article the better idea is to produce a non-sense article. A non-sense article has the advantage that automatic spam protection are not able to recognize it and sometimes it took weeks until an admin will detect the problem manually. The best way to create a nonsense article for Wikipedia is the Scigen generator, https://en.wikipedia.org/wiki/SCIgen It was invented with the aim to fool an academic journal but it works also for wikipedia.

The first step is visit the Scigen website and press “generate new paper”. Then the document has to be converted into the wikisyntax. If everything looks fine, it can be uploaded to wikipedia. The advantage over normal vandalism is, that on the first look the Wikipedia article is similar to a real article. The automatic incoming filter of Wikipedia which checks all the content will not be alarmed, because it is normal text, contains no plagiarism and provides references to other academic papers. To recognize the problem, somebody must read it in detail, but this is never done. Most admins are in hurry because each day around 700 articles are created from scratch. So our non-sense article can stay in the encyclopedia and we had a lot of fun during the break.

The confusing unification of 16 bit architectures and Internet QoS is an extensive riddle. In fact, few cryptographers would disagree with the analysis of thin clients. We disprove that the little-known authenticated algorithm for the evaluation of thin clients by J. Smith is maximally efficient.<ref name="arun2003" />

==Introduction==
In recent years, much research has been devoted to the structured unification of I/O automata and Smalltalk; nevertheless, few have constructed the deployment of consistent hashing. The notion that scholars agree with journaling file systems is generally adamantly opposed.<ref name="arun2003" /> The notion that analysts interfere with interposable symmetries is always bad. This is essential to the success of our work. Therefore, Moore's Law and reinforcement learning agree in order to realize the development of IPv7.<ref name="brown2004" />

We next turn to all four experiments, shown in Figure 4. Of course, all sensitive data was anonymized during our bioware emulation. Note how emulating Markov models rather than emulating them in hardware produce more jagged, more reproducible results. Bugs in our system caused the unstable behavior throughout the experiments.<ref name="white2000" />

==References==
<references>
<ref name="white2000">
{{cite journal
| title = ArgivePlexus: Multimodal, introspective communication
| author = N. White and J. Hennessy
| journal = Journal of Flexible, Stable Random Methodologies
| volume = 9
| pages = 1--11
| year = 2000
}}
</ref>

<ref name="brown2004">
{{cite journal
| title = A case for Voice-over-IP
| author = K. Brown, C. Miller, S. Cook, and R. Stearns
| journal = Journal of Semantic, Authenticated, Modular Configurations
| volume = 4
| pages = 20--24
| year = 2004
}}
</ref>

<ref name="arun2003">
{{cite journal
| title = A case for context-free grammar
| author = M. Arun and C. Maruyama
| journal = Journal of Classical Algorithms
| volume = 737
| pages = 70--90
| year = 2003
}}
</ref>

</references>

Sometimes, a wikipedia article with a high amount of traffic is blocked as default. But that is no problem, because many others can edited freely. Here is the list of most visited lemmas. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Computer_science/Popular_pages For example, the topic “Support vector machine” has over 2000 views per day, but it is not protected. So it is the ideal starting point to drop some nonsense. If the aim is, that the Scigen content stays longer in the wikipedia, it is good idea to search for a low traffic lemma. That is not observed carefully, and we can make edits without being interrupted by the admins.

Advertisements

3 thoughts on “Limits of Wikipedia

  1. Ich mach mal einen anderen Nick, damit nicht immer der gleiche kommentiert. :-)

    Bliebe noch zu erwaehnen, dass die deutsche Wikipedia bezueglich politischer Themen eine Propagandaschleuder ist oder unliebsame Dinge gerne mal weggelassen/aktiv gestrichen werden, die man dann nur noch in der englischsprachigen Version findet und dass in der deutschsprachigen Wikipedia in etlichen politischen Artikeln (natuerlich ausschliesslich Mainstream-)Meinungen enthalten sind (was in einem Lexikon sowieso nichts verloren hat, also eine Meinung des/der Autoren bzw. der Sichter, ob nun Mainstream oder nicht). Das geht hin bis zur Diffamierung von unliebsamen Personen. Als Hilfsmittel zur eigenen politischen Meinungsbildung ist die deutsche Wikipedia komplett unbrauchbar, es sei denn man will in entsprechend gesinnten Runden Karriere machen und nicht durch “unpassende Meinung” auffallen – dazu koennte die deutschsprachige Wikipedia durchaus noch Unterstuetzung leisten.

    (gibt noch einen Teil 2 davon)

    usw.
    Ok, und um das Sternzeichen eines Politikers etc. zu erfahren kann man die de.-Wikipedia MEISTENS auch noch gebrauchen (habe aber durchaus schon Eintraege gesehen, in denen sogar das Geburtsdatum einer prominenten Person nicht auftauchte). :-)

    Zu den komplett ueberteuerten Enzys: 10000 $ fuer 9000 Seiten…grenzt ja schon an Wucher. (Ich interessierte mich auch mal fuer eine Daten-CD, die 30000 Euro kosten sollte – was meinen Etat allerdings ein “wenig” gesprengt haette :-( und ich den Gedanken der Anschaffung dann doch verwarf.)
    Sieht so aus als ob die brauchbaren, aktuellen und genauen Infos in bestimmten Kreisen (Konzerne etc.) verbleiben sollen.
    Nen Tausender taete ich ja noch fur ein sehr gutes 9000-Seiten-Werk hinlegen und wenn es so wirklich wirklich richtig superduper gut ist, vielleicht auch 2k – das sollte eigentlich auch teuer genug sein. 1 $/Euro pro Seite und mehr scheinen mir doch aber arg uebertrieben bis pervers. Klar, ein fetter Konzern oder ein durch wen auch immer gesponsorter Vereins-, Instituts-, …-Haufen usw. kann sich sowas schon leisten – da liegt wohl auch der Hase im Pfeffer.
    Das erinnert mich irgendwie daran, dass die grossen Software-Firmen seinerzeit ja dem Open Source Einhalt gebieten wollten weil sie ihre Umsaetze schwinden sahen. Mittlerweile ist das Open Source zwar akzeptiert; bei Open Data stellen sich sehr viele aber immer noch energisch quer…

    Like

  2. Uebrigens kann es sein, ist mir schon mal aufgefallen, dass Links/videos im Kommentar unter Umstaenden nicht angezeigt werden.
    Dann, wenn es also so aussieht, als ob was fehlt, ggf. Seite neu laden.

    Like

    • The comments can be written in the markdown syntax. Links to external websites too. Example:

      Hello world

      The brown fox jumps over the lazy dog.

      Sourcecode

      #include <iostream>
      #include <random>
      int main() { 
      
      }
      

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.