Parse WordPress Post Export With Python

By | February 6, 2021

Within WordPress it is possible to export all historic posts as an XML file. This XML file is a little unwieldy, but it is possible to parse the WordPress post export with python. This post shows how you can use the python feedparser library to easily access the export elements.

When you do this WordPress is keen to emphasise that this is not intended as a backup, so be warned.

WordPress Post Export

One option for handling this exported XML file is to manually parse it to access the elements you are interested in.

As it happens however, the WordPress export XML is close enough to the an RSS feed, that we can use the python package feedparser.

Parse With Feedparser

Feedparser can parse URLs or local files. Once you have exported your WordPress posts you can parse the the resulting xml file with this python code:

import feedparser

data = feedparser.parse('./data/deparkes.WordPress.2021-01-23.xml')

Once you have loaded the parsed data, you can explore it following the principles described on this page.

As a quick example, here is how you could extend that script to access the actual content of reach post.

posts = []

for entry in data['entries']:
    posts.append(entry['content'][0]['value'])

In this case you may also want to clean up the html from within the posts, such as with this approach: