A large scale study of edit patterns in Wikipedia and its applications to vandalism detection
Abstract
In recent years, Web 2.0 applications such as Wikipedia have transformed the landscape of the World Wide Web by elevating the end-users from being passive consumers of information to ones that actively participate in content creation, organization and propagation. Wikipedia is a free online encyclopedia where any user can edit information with minimal restriction. Recent studies indicate that a large fraction of Internet users rely on Wikipedia for their information needs. Thus, it is immensely important to ensure the quality and accuracy of information that is shared on Wikipedia. Ironically, the open-edit nature of Wikipedia has also made it susceptible to various kinds of vandalism attacks.
In this thesis, we perform a large-scale study of the edit patterns of Wikipedia articles. The goal of this study is to identify meta-data characteristics that can help us distinguish between high-quality edits and potential vandalism attacks. Our study is unique in several different aspects. Firstly, we trace the history of edits of Wikipedia articles and study the stability of articles, their growth over time, and the nature of users who perform the edits. Secondly, we study the spatial distributions of the origin of the edits. Thirdly, we also study the commonality of content and commonality of users among various Wikipedia articles. Through this study, we show that various types of contextual attributes of edits such as co-occurrence probabilities of words, registration status of edit contributors, and geographical region of origin of edits have strong distinguishing capabilities with regards to vandalism.