I finally made time to do a few updates to the RPGExplorer process yesterday. The work resulted in two key improvements: 1) reduction of HTML fragments and HTML keywords as tags; 2) applying some additional intelligence to the Twitter output stream; and 3) Updates to the output on the RPGExplorer site.
The HTML handling was primarily grunt work consisting of cleaning up of the text extraction method and resulting keyword. The text extraction engine often receives malformed HTML fragments that cause keywords from the standard to be flagged as important tags. I improved the text extraction process and improved areas where that failed. Additionally, I added analysis of the resulting tags to remove any HTML keywords. Not sexy but badly need.
Once the tag information was cleaner, I added some pseudo-intelligence hashtag output when writing Tweets. The code simply evaluates for the presence of a subset of the overall tags and applies appropriate hashtags if space is available in the tweet. To simplify the output, I also used the TinyUrl shortener service, which is very meta since Twitter automatically shortens Urls as well but doesn’t make the information available prior to posting the tweet. The shorter URL length makes the resulting tweet far easier to read.
Finally, I removed author index pages from the RPGExplorer site. The author information was not uniquely identifying authors with the same name so the output intermixed them. As such, it wasn’t useful information. Also, it reduces the size of the overall site keeping it from out-growing the limited disk space allocation on the hosting site.
I still have a significant TODO list for the overall process including building a useful tag taxonomy and building an application to apply periodic online training to the tagging process to coalesce tags into a smaller, more useful set. Those improvement will wait until another day as I want to do some creative stuff for an Old West campaign setting.
I have been tinkering with natural language processing for over six months on and off. My target was always the table top role playing community. The NLP stuff has not delivered the initial dividends I was hoping for. Half the effort was accumulating and polling feeds. That alone has issues. Trying to extract useful information is far harder.
As a byproduct of that work, I rolled out RPGExplorer. It currently contains a rolling 2-3 week window of feed articles linking back to the origin site along with my attempts at categorization/classification/tagging. I’m still poking at the NLP side plus tweaking the theme of the RPGE output.
I’m still uncertain where the end product lies. The current state was initially just a visualization tool for me to determine how well the NLP was doing. I’m going to tweak that side more thoroughly as time allows in between other projects I have in the pipeline.
Thus far, I don’t believe the NLP is up to the end goal of automated classification but the results are still useful for my purposes and perhaps other. The site also tweets every article it processes at @RPGExplorer. I was going to integrate a G+ page for the same purposes but that API is largely restricted and nearly impossible to use as a hobbyist. Facebook might be more accessible but I have little interest in opening that particular box of worms quite yet.
Accurate maps of towns and cities and the old west are somewhat difficult to find. Many of the historical archives contain bird’s eye maps of significant sites. However, fewer sites contain actual maps of the towns. As I was searching around, I came across several plat maps from the Sanborn Company. Apparently, Sanborn was a mapping company that started shortly after the end of the Civil War. The primary product were fire risk maps.
Sanborn operated from 1867-2007 with maps from over 12,000 locations. My particular interest is the era from 1860 – 1890 but others may be interested in the maps from more modern eras. Many of the older maps have passed into the public domain and archived by several national and state organizations. Check the external links section on the Sanborn Maps Wikipedia article for a good list of organizations.
Continue reading »
Sorry folks, I’m not voting for you or anyone else in the ENnie competition. It’s not because I don’t like your blog, supplement, or website. Nay, its because I don’t believe the ENnies are a valid indicator of the tabletop gaming community.
First and foremost, any award site that requires self nomination is broken. If the supplement or whatever was useful, it would not require self nomination. It would already be known, useful and in the hands of game masters everywhere. I suspect most of them already are.
Secondly, send me free stuff … just equates to getting free stuff. No matter the intent. It’s fundamentally jacked up in my mind. Why should anyone need to send physical products in an electronic age for anything other than a physical product category?
Don’t forget, ENWorld, origin of the ENnies is predominately a fan site for the latest and greatest edition of either a WotC product or a Paizo product. That’s fine. Functionally, they produce more product than anyone else. Quite often, it is of great quality overall. Certainly, it has good production qualities.
I will not vote in a rigged popularity contest. I think the awards are complete crap. If I were a publisher, I might have a different opinion. I am not.
Good luck to the small publishers.
I was digging through a couple of boxes and came across my copy of Gamma World 1st Edition. Sadly, it is not the original boxed set just a copy of the rules. I believe I picked it up in the 1990’s on a shopping spree. Leafing through the book, it remains a very playable game but not one I was ever interested in running back at the height of my gaming.
Gamma World, like many other games of the TSR era, has been through many permutations. At least 7 editions have been published to date with varying degrees of acceptance. The 1st edition was released in 1978 and penned by James M. Ward and Gary Jaquet.
Gamma World features one of the first post apocalyptic frameworks in the RPG world. The book references the year 2471 as the start of the experience while referring back to events from 2309-2322 — the Shadow Years. The Shadow Years depict a time of violent social unrest that lead to the destruction of the civilization as it was known. It isn’t hard to imagine how the late 1960’s and early 1970’s could have influenced that portrait of the world. The exact downfall of the world is not stated. Hints of nuclear and bio-warfare are present but no specifics in the roughly 1 page introduction of the game setting.
I recently started an RSS crawler after the announcement of the RPGBA was shutting down. My goals were rather simple: 1) Collate information for personal use. and 2) Apply some level of natural language processing for tagging. Sounded good on paper when I first sketched it out.
And now onto the problem list.
#1: RSS Feeds are a PITA
Quite simply, RSS has standards. Multiple standards. So what works for RSS doesn’t necessarily work for an ATOM based output. Both RSS and ATOM have variants. Some fields you’d like to see just are not present in one versus the other. The vast majority of the differences are “nice to have” rather than critical. So I just adjusted the code.
#2: Contextual Analysis
To achieve reasonable context extraction, the post summary generally doesn’t have sufficient length. Thus, I used a web API to extract full text of articles on the blogs I was curating. No surprise, it has problems extracting reasonable text within many sites. Some due to being malformed HTML or other issues. I’d often get the site’s word cloud of tags, categories, etc. Even worse, I’d pick up the “Blogs I Read” section.
Grabbing the full list of tags/categories has major issues for contextual categorization. It floods the detector with a large list of keywords that aren’t actually tied to the specific post.
Unfortunately, my few test sites didn’t have the text extraction problem so I collated a whole lot of nonsensical data over the course of a few weeks. When I sat down to build an initial taxonomy, the problem was obvious.
Solving the problem wasn’t easy. First, I took a look at sentence length to avoid overly short sentences. Some of the extraction returns huge swaths of text. For that, I looked first at overall sentence length and then words per sentence. In the end, it’s simple not tractable to solve. Too many writers use complicated sentences – compounded upon compounds. The solution is better but can never be precise.
#3: Keyword Extraction
For keyword extraction, I again turned to a natural language processing API. As you might expect, if the target text contains non-essential information, it returns garbage. Worse, it gives you a huge tie to a cloud of tags that aren’t relevant as mentioned in #2.
The target information I wanted is obviously web site data. The text extraction engine quite often returns HTML fragments. Chunks of html muck up the keyword extractor. Originally, I did some keyword cleanup but it was insufficient.
I eventually wrote a normalization routine that attempts to make capitalization sane across the keywords. It also strips out often seen HTML fragments.
In the end, I tossed out about 6 weeks of aggregation. The garbage text I allowed to flow through the process overwhelmed the useful results. I’ve not started to accumulate data once again. It’s better but far from perfect. The goal is not to have perfect inputs but rather extract some knowledge from the noise.
Blog writers… some of you use some seriously long sentences. Shorter is better. Break up those massively compound sentences. The idea will be more clearly expressed. Just my $0.02.
Well, sometimes I have edge case bugs that are hard to track down. The failure to generate female names via the Medieval Names Generator was not one of those. I’m certain it worked at one point but I must have broken it at some point I have yet to find in the change log history.
I tweaked the interface slightly to use a robust paradigm used in many of my other generators to fix the issue. Unfortunately, it has been broken for many months.
My thanks to the anonymous individual who dropped me feedback to let me know it was busted. My apologies to the those who tried to generate female names only to get only male results.
Bugs. I code them daily.