Apparently the ROTWORLD NPC generator broke during an upgrade some time in the past. The generator was fixed earlier today and is once again functional.
Once more unto the breach… uh, no, once more onto bug fixes. The prior changes caused a failure in some posts failing to be tweeted. Due to the eradication of HTML tags, occasionally a null value was returned into the tag list which I failed to take into account.
I finally made time to do a few updates to the RPGExplorer process yesterday. The work resulted in two key improvements: 1) reduction of HTML fragments and HTML keywords as tags; 2) applying some additional intelligence to the Twitter output stream; and 3) Updates to the output on the RPGExplorer site.
The HTML handling was primarily grunt work consisting of cleaning up of the text extraction method and resulting keyword. The text extraction engine often receives malformed HTML fragments that cause keywords from the standard to be flagged as important tags. I improved the text extraction process and improved areas where that failed. Additionally, I added analysis of the resulting tags to remove any HTML keywords. Not sexy but badly need.
Once the tag information was cleaner, I added some pseudo-intelligence hashtag output when writing Tweets. The code simply evaluates for the presence of a subset of the overall tags and applies appropriate hashtags if space is available in the tweet. To simplify the output, I also used the TinyUrl shortener service, which is very meta since Twitter automatically shortens Urls as well but doesn’t make the information available prior to posting the tweet. The shorter URL length makes the resulting tweet far easier to read.
Finally, I removed author index pages from the RPGExplorer site. The author information was not uniquely identifying authors with the same name so the output intermixed them. As such, it wasn’t useful information. Also, it reduces the size of the overall site keeping it from out-growing the limited disk space allocation on the hosting site.
I still have a significant TODO list for the overall process including building a useful tag taxonomy and building an application to apply periodic online training to the tagging process to coalesce tags into a smaller, more useful set. Those improvement will wait until another day as I want to do some creative stuff for an Old West campaign setting.
I have been tinkering with natural language processing for over six months on and off. My target was always the table top role playing community. The NLP stuff has not delivered the initial dividends I was hoping for. Half the effort was accumulating and polling feeds. That alone has issues. Trying to extract useful information is far harder.
As a byproduct of that work, I rolled out RPGExplorer. It currently contains a rolling 2-3 week window of feed articles linking back to the origin site along with my attempts at categorization/classification/tagging. I’m still poking at the NLP side plus tweaking the theme of the RPGE output.
I’m still uncertain where the end product lies. The current state was initially just a visualization tool for me to determine how well the NLP was doing. I’m going to tweak that side more thoroughly as time allows in between other projects I have in the pipeline.
Thus far, I don’t believe the NLP is up to the end goal of automated classification but the results are still useful for my purposes and perhaps other. The site also tweets every article it processes at @RPGExplorer. I was going to integrate a G+ page for the same purposes but that API is largely restricted and nearly impossible to use as a hobbyist. Facebook might be more accessible but I have little interest in opening that particular box of worms quite yet.
I recently started an RSS crawler after the announcement of the RPGBA was shutting down. My goals were rather simple: 1) Collate information for personal use. and 2) Apply some level of natural language processing for tagging. Sounded good on paper when I first sketched it out.
And now onto the problem list.
#1: RSS Feeds are a PITA
Quite simply, RSS has standards. Multiple standards. So what works for RSS doesn’t necessarily work for an ATOM based output. Both RSS and ATOM have variants. Some fields you’d like to see just are not present in one versus the other. The vast majority of the differences are “nice to have” rather than critical. So I just adjusted the code.
#2: Contextual Analysis
To achieve reasonable context extraction, the post summary generally doesn’t have sufficient length. Thus, I used a web API to extract full text of articles on the blogs I was curating. No surprise, it has problems extracting reasonable text within many sites. Some due to being malformed HTML or other issues. I’d often get the site’s word cloud of tags, categories, etc. Even worse, I’d pick up the “Blogs I Read” section.
Grabbing the full list of tags/categories has major issues for contextual categorization. It floods the detector with a large list of keywords that aren’t actually tied to the specific post.
Unfortunately, my few test sites didn’t have the text extraction problem so I collated a whole lot of nonsensical data over the course of a few weeks. When I sat down to build an initial taxonomy, the problem was obvious.
Solving the problem wasn’t easy. First, I took a look at sentence length to avoid overly short sentences. Some of the extraction returns huge swaths of text. For that, I looked first at overall sentence length and then words per sentence. In the end, it’s simple not tractable to solve. Too many writers use complicated sentences – compounded upon compounds. The solution is better but can never be precise.
#3: Keyword Extraction
For keyword extraction, I again turned to a natural language processing API. As you might expect, if the target text contains non-essential information, it returns garbage. Worse, it gives you a huge tie to a cloud of tags that aren’t relevant as mentioned in #2.
The target information I wanted is obviously web site data. The text extraction engine quite often returns HTML fragments. Chunks of html muck up the keyword extractor. Originally, I did some keyword cleanup but it was insufficient.
I eventually wrote a normalization routine that attempts to make capitalization sane across the keywords. It also strips out often seen HTML fragments.
In the end, I tossed out about 6 weeks of aggregation. The garbage text I allowed to flow through the process overwhelmed the useful results. I’ve not started to accumulate data once again. It’s better but far from perfect. The goal is not to have perfect inputs but rather extract some knowledge from the noise.
Blog writers… some of you use some seriously long sentences. Shorter is better. Break up those massively compound sentences. The idea will be more clearly expressed. Just my $0.02.
Well, sometimes I have edge case bugs that are hard to track down. The failure to generate female names via the Medieval Names Generator was not one of those. I’m certain it worked at one point but I must have broken it at some point I have yet to find in the change log history.
I tweaked the interface slightly to use a robust paradigm used in many of my other generators to fix the issue. Unfortunately, it has been broken for many months.
My thanks to the anonymous individual who dropped me feedback to let me know it was busted. My apologies to the those who tried to generate female names only to get only male results.
Bugs. I code them daily.
I fixed a minor issue in the Labyrinth Lord treasure generation system. Specifically, the issue was not capping level for druids (from AEC) to a maximum of 14. The other spell casting classes have spell progressions to level 20. Due to the lack of a cap, occasionally the spell book creation process failed for druids.