Crawling Some Feeds

Apr 29, 2015
Mark
Comments Off on Crawling Some Feeds

I recently started an RSS crawler after the announcement of the RPGBA was shutting down. My goals were rather simple: 1) Collate information for personal use. and 2) Apply some level of natural language processing for tagging. Sounded good on paper when I first sketched it out.

And now onto the problem list.

#1: RSS Feeds are a PITA

Quite simply, RSS has standards. Multiple standards. So what works for RSS doesn’t necessarily work for an ATOM based output. Both RSS and ATOM have variants. Some fields you’d like to see just are not present in one versus the other. The vast majority of the differences are “nice to have” rather than critical. So I just adjusted the code.

#2: Contextual Analysis

To achieve reasonable context extraction, the post summary generally doesn’t have sufficient length. Thus, I used a web API to extract full text of articles on the blogs I was curating. No surprise, it has problems extracting reasonable text within many sites. Some due to being malformed HTML or other issues. I’d often get the site’s word cloud of tags, categories, etc. Even worse, I’d pick up the “Blogs I Read” section.

Grabbing the full list of tags/categories has major issues for contextual categorization. It floods the detector with a large list of keywords that aren’t actually tied to the specific post.

Unfortunately, my few test sites didn’t have the text extraction problem so I collated a whole lot of nonsensical data over the course of a few weeks. When I sat down to build an initial taxonomy, the problem was obvious.

Solving the problem wasn’t easy. First, I took a look at sentence length to avoid overly short sentences. Some of the extraction returns huge swaths of text. For that, I looked first at overall sentence length and then words per sentence. In the end, it’s simple not tractable to solve. Too many writers use complicated sentences – compounded upon compounds. The solution is better but can never be precise.

#3: Keyword Extraction

For keyword extraction, I again turned to a natural language processing API. As you might expect, if the target text contains non-essential information, it returns garbage. Worse, it gives you a huge tie to a cloud of tags that aren’t relevant as mentioned in #2.

The target information I wanted is obviously web site data. The text extraction engine quite often returns HTML fragments. Chunks of html muck up the keyword extractor. Originally, I did some keyword cleanup but it was insufficient.

I eventually wrote a normalization routine that attempts to make capitalization sane across the keywords. It also strips out often seen HTML fragments.

Wrap Up

In the end, I tossed out about 6 weeks of aggregation. The garbage text I allowed to flow through the process overwhelmed the useful results. I’ve not started to accumulate data once again. It’s better but far from perfect. The goal is not to have perfect inputs but rather extract some knowledge from the noise.

P.S.

Blog writers… some of you use some seriously long sentences. Shorter is better. Break up those massively compound sentences. The idea will be more clearly expressed. Just my $0.02.

Tags:

Comments are closed.