Discovering Wikipedia edits made by institutions, companies and government agencies
--
TLDR: I built a tool to search Wikipedia edits made by organizations, companies, and government agencies. Click here if you want to see it live, or keep reading for a brief explanation of how it works and how I implemented it.
A couple of months ago, an idea came to mind of analyzing Wikipedia edits to discover which public institutions, companies or government agencies were contributing to Wikipedia, and what they were editing.
After a quick Google search I realized that it had been done before, but the service, called WikiScanner, had been discontinued in 2007. After WikiScanner, the idea surfaced again several years later: in 2014 the @congressedits Twitter account was created, which automatically tweeted any Wikipedia edit made by IP addresses belonging to the U.S. Congress. The account was eventually suspended by Twitter (read why here). The code for this bot was released under a CC0 license on Github, and several other bots were created, looking for edits from different organizations.
At the moment of this writing, some of these systems are still active on Twitter (e.g., @parliamentedits), but my interest was in building something that allowed users to search and navigate edits efficiently (rather than just having a stream of tweets), and that was not limited to monitor a single organization. I decided to dedicate some of my free time to this side project and build Wikiwho.
Methodology
I decided to use the same approach that WikiScanner used, which is to identify organizations based on their IP address. This is possible because, when somebody edits Wikipedia without being logged in, his IP address will be logged instead of his username. It follows that this system is not able to identify edits made by logged in users, which are simply discarded during a pre-processing step.
Each log entry contains metadata about the edit, e.g., time, IP address, edited page, but does not provide any information on what was actually edited. This already allows us to perform some analysis, e.g., which organizations are the most active, which pages have been edited the most and by who, which periods of time saw the most activity, etc…