TunkRank Improvements

Posted: 17 February 2010 in Uncategorized
Tags: , , , , , , , , ,

Over the past few weeks, I’ve been working on a number of improvements to TunkRank that I will be rolling out tonight. First, I’ve secured a server to host it on, rather than my old Dell laptop, so reliability should improve and TunkRank is no longer a slave to dynamic DNS problems. Also, my cable company is less likely to hunt me down. TunkRank has gotten some increased attention over the past few weeks, including from Chris Dixon, CEO of the wonderful website hunch:

Twitter could fix the whole follower obsession by highlighting a more meaningful metric like TunkRank.

Awesome! So with this new version, there are a few changes that will immediately impact you, the end-user. I’ll go into the ones that affect you the most first, followed by some technical points of interest for those who care. Then I’ll conclude with a couple of hints at the future.

Changes to TunkRank

First and foremost, I have changed the main score that is reported. Previously I was using a percentile in the range (1-100). This got a lot of objections and created confusion. Partially because I consider the 100th percentile to be the “top-tier” of users, while standardized testing often reports the 99th percentile to mean you performed better than 99% of the population. Also, most people who actually care about their scores enough to use TunkRank are in the 95-100 percentile range, making more fine-grained comparisons difficult. Neal Richter even posted on his blog some suggestions for improving it (quite a while ago, now).

I took a page out of Neal’s book with the log scores, but I also put it in a range where the most influential twitter user (let’s call her MAX) will always have a score of 100. Your TunkRank Score™ is the ratio of the log of your raw score to the log of MAX’s score. So formulas aside, this means your TunkRank score is directly comparable to other users and is always in perspective of the maximum influence exerted by any user in the Twitterverse. Incidentally, comparing users with a difference of seven TunkRank score points means the user with the higher score is about twice as influential.

Accessing the API has also changed slightly, and I apologize to anyone actually using it at the moment. Basically, I am matching the API calls to more closely conform to the URLs used on the web side, and I’m returning more information with each call. TunkRank also supports XML responses in addition to JSON. You can find all of the documentation here.

Some Technical Notes

As part of the move, I’ve decided to transition from using Merb to Rails. My original decision to use Merb was partially as a learning exercise, but also because Merb appealed to me with its being lightweight. However, I often ran into roadblocks because some useful plugin wasn’t supported (or I couldn’t figure out how to make it work in the limited time I had). Sometimes the documentation for Merb was very good and sometimes it was absent altogether. Rails, on the other hand, has a substantial amount of documentation and people are always blogging about the best way to do things — which makes life as a developer much easier. Rails is my day job, so I knew I could transition quickly and easily.

I also migrated from MySQL to PostgreSQL. The main reason is that I love PostgreSQL — plain and simple. They both have their advantages, but MySQL gives me a sense of uneasiness I don’t have with PostgreSQL. I’ve managed to achieve some nice speed improvements as part of the redesign, though that is not to say that the same speed improvements wouldn’t have been possible with MySQL.

I’ve also adopted Resque as my background job-processing library. It is backed by Redis, an advanced key-value store that you can think of as a “data structures server.” The important thing for me is that Resque is fast, has a kick-ass web interface, and integrating with Rails is brain-dead easy.

The Road Ahead

I wrote before about the road ahead for TunkRank, and I have mostly held to it. I have many more ideas I want to expand on, including topic-sensitive influence rankings. I like the ideas in the recent WSDM paper (pdf) by Weng et al, but I have a few new ideas I’m eager to try out. TunkRank scores may also be integrated into Tickery in the near future, thanks to some discussions with Terry Jones of FluidDB.  I’m excited!

At the Atlanta Semantic Web Meetup tonight, Vishy Dasari gave us a quick description and demo of a new search engine called Semantifi.  They purportedly are a search engine for the deep web, meaning the web that is not indexed by traditional search engines because the content is dynamic.  They are just in the very early stages, but have opened the site for people to play with and add data to via “Apps.”  These apps are sort of like agents that respond to queries, returning results to some marshal process that decides which App will get the right to answer.  Results are ranked by some method I wasn’t able to ascertain, but it reminded me of how Amy Iris works.  These apps form the backbone of the Semantifi system, it seems, and they are crowdsourcing their creation.  You can create a very simple app to return answers on your own data set in a few short minutes.

Perhaps more interesting is that they use a natural language interface in addition to the standard query sort of interface we’re all used to.  Given the small amount of data currently available, I couldn’t really determine just how well this interface performs.  It is based on a cognitive theory by John Hawks (sp?) that apparently states we think in terms of patterns.  That’s very general and I haven’t been able to chase down that reference — and I forgot to ask Vishy for more info at the meetup.  If someone can clear that up for me, I’d be grateful.  The only seemingly relevant John Hawks I could find is a paleoanthropologist, so not sure.  Anyhow, these patterns are what Vishy says the system uses to interpret natural language input.  That may be a grandiose way of saying n-gram matching.

While Wolfram|Alpha is a computational knowledge engine™, Semantifi does not make that claim. Apps may compute certain things like mortgage values, but it’s not a general purpose calculator.  However, Semantifi is looking at bringing in unstructured data from blogs and the like, that W|A ignores.  It remains to be seen what that will look like, though.  Also, users can contribute to Semantifi while W|A is a black box.  In any case, they are making interesting claims and I look forward to seeing how they play out with more data.

Note: All of my observations are based on notes and memories of tonight’s presentation, so if I made any mistakes please post corrections in the comments or email me.

Unintentional HCIR commercial

Posted: 7 November 2009 in Uncategorized
Tags: , , ,

This commercial just caught my eye and made me think about faceted search.


Posted: 31 October 2009 in Uncategorized
Tags: , ,

We decided to do a complicated pumpkin design this year and it turned out surprisingly well!  I present, the Daedalpumpkin:


The Daedalpumpkin

The Daedalpumpkin



There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

Books and movies

Posted: 16 August 2009 in Uncategorized
Tags: , , , ,

This post contains NO spoilers.

I saw The Time Traveler’s Wife with my wife today.  I had read the book about a year ago, and had been looking forward to the movie.  I wasn’t disappointed — I thought the movie was very moving and captured the spirit of the book, even if it didn’t capture everything.  It ignored some dynamics that the book elaborated on and some scenes and details were slightly different.

One thing I was concerned about while watching the movie was just how much I was liking it because I knew all the background in the book, or how much came from the movie.  If the former was true, then the movie wasn’t going to be that great an experience for someone who had read it.  If the latter was true, then it was a damn good movie.  I don’t have the answer to that.

Another concern is how it’s a cultural norm in our society to bash movies based on books, and yet to relentlessly watch them to the point that Hollywood feels compelled to turn every book that sells a few copies into one.  Douglas Adams once made the point that he changed the story of the Hitchhiker’s Guide to the Galaxy to match the medium he was writing it for.  A story that plays well on the radio can take advantage of completely different things when it is translated to book or movie form.  I don’t have the exact quote and searching for that kind of thing is damn near impossible on Google (let me know if you find it).

But that’s an observation I have long taken to heart when watching movies translated from books.  Obviously you can’t fit an entire book into 2 hours and still have a story that tells like anything worth watching.  You can’t capture the full power of every scene, every nuance, nor every subtlety that a book can.  That’s not what the silver screen does well.  What it does well (when it is done right) is making you feel in touch with characters and the story.  Books do that too, but movies actually put the images before your eyes.

That said, I have never been able to bring myself to read a book based on a movie.  I just can’t do it.

Next total solar eclipse in Atlanta

Posted: 4 August 2009 in Uncategorized
Tags: , , ,

I already knew Wolfram|Alpha could do some cool astronomy calculations, like comparing the escape velocities of the Galilean moons.  A recent W|A blog post also pointed out that you can calculate the next lunar eclipse.  So I tried to see when the next solar eclipse would be for my area and it came up with a partial solar eclipse in 2014.  Skip that and go to the next and it turns out there’s going to be a decent one in 2017.  As a reminder, I sent an email to myself via FutureMe.  It’ll be interesting to see if a) I’m still using gmail in 8 years, b) if FutureMe is still around sending emails, and c) if we can still see the sun.  Man, I love W|A.

Total solar eclipse in 2017

Total solar eclipse in 2017