Wikipedia Data Stream

By | July 2, 2020

Streaming data is an important part of modern data processing. If you are just starting out, and perhaps don’t yet work somewhere with access to a big data streaming infrastructure, it can be hard to know where to start. This post talks you through a simple wikipedia data stream example from the wikimedia documentation.

Wikipedia Data Stream

A little known feature about wikipedia is that it publishes a stream of recent changes. The wikipedia data stream is actually just one part of a larger stream of real-time edits to sites in the wikimedia group.

We can use this example to gain access to a simple data stream example, but it’s also been used for very cool projects like Listen To Wikipedia which lets listen to edits as they happen!

Streaming Data Basics

Having access to a data stream immediately brings home some of the challenges it brings that static data just doesn’t have. The Wikipedia data stream is a nice example stream for playing around with some of the basics of working with stream data.

For this example to work you will need to install the sseclient library for handling server-side events – it lets you iterate over messages sent by the server.

from sseclient import SSEClient as EventSource
stream_url = 'https://stream.wikimedia.org/v2/stream/recentchange'
for event in EventSource(url):
    print("Do something with the event")

Capture a stream

Using the python example on the wikimedia page, it is quite easy to start capturing the stream.

At time of writing (check the documentation in case it has changed) the returned message format is this

event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":1532031066001},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}]
data: {"event": "data", "is": "here"}

We can access the five most recent messages like this:

from sseclient import SSEClient as EventSource
url = 'https://stream.wikimedia.org/v2/stream/recentchange'
event_count = 1
for event in EventSource(url):
    print("Stream message: " + str(event_count))
    print(event.event)
    print(event.data)
    event_count += 1
    if event_count > 5:
        break

If you run this (once you’ve installed sseclient), you should get back five recent messages.

Parsing streamed data

We can now extend to the full example in the wikimedia documentation. I’ve modified it below to only return five recent messages.

This full example parses the json message as it arrives and prints some of the message out.

import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/recentchange'

wiki = 'commonswiki'
event_count = 1
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            continue
        if change['wiki'] == wiki:
            print('{user} edited {title}'.format(**change))

            event_count += 1
            if event_count > 5:
                break