If you need to pull all the content from a blog via its RSS feed and dump it into Excel, how do you do it?
If the blog has been archived by Google Reader, it’s pretty easy to do. No-one seems to have written a guide to this online, so I thought I’d do it.
- Use this special URL:
Replace [RSS] with the URL of the RSS feed, [N] with the number of posts wanted (max. 1000)
N=1000 tells the API to provide the previous 1000 posts.
The resulting page can then be saved as an XML document, prior to importing it into Excel.
If you want more than 1000 posts, you have to modify the URL somewhat. First, save the XML file from the previous step. Open it up in a text editor, and look for a tag that looks like this:
<code>gr:continuation CO6a-c7Z_awC /gr:continuation </code>
The tag is located right in the beginning of the file, before the content of the blog starts. If you can’t see this tag, that means that either there aren’t 1000+ posts, or Google doesn’t have an archive of the earlier ones. Google only archives posts if someone starts following it in Google Reader, or uses Feedburner to create a feed. Most popular blogs are fully archived, but smaller ones (like this one) only have a few hundred entries archived.
- Once you have the Continuation String (the text inside the gr:continuation tag), you have to craft a different URL:
Replace [RSS] with the URL of the RSS feed, [N] with the number of posts wanted (max. 1000), and [C] with the Continuation String.
- Using this method, you can retrieve 1000 posts at a time, till you reach the end of the archive. Keep saving the files as XML’s and then you can combine them later in Excel to create one large file with all the data.
Note: Excel 2011 (on the Mac) won’t work here. It opens XML as a flat text file. Excel on Windows creates an XML schema, and this allows for easy manipulation of the data. Once the schema is created in Windows, the resulting file can be saved as an XLS and then opened in Excel 2011.