Posts Tagged ‘information retrieval’

A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]

This week has given me two new toys to play with, and you could probably say both were bought at the dollar store.  The first was Microsoft‘s release of Rebranded Live, aka Bing.  Bing’s search results have been poor (for me), but not much poorer than Google‘s.  Just enough poorer for me to see no reason to really switch, which is very bad for Microsoft.  There are neat little features, like pop up feed links for blog posts and previews.  I like it, but it’s not much.  Where they shine is in image search, which incorporates similar image search already (Google still has theirs in Labs).  Google Similar Images knocked my socks off at first, but then it just seemed like it should be renamed Google Identical Images.  Not much diversity.  Bing got this part right.  The images are similar, not identical.  There is a diverse collection and the navigation is great.  Kudos, Live Labs, for that one.  Is it perfect?  Nope, but it’s better than what I was using.

The next toy was Google Squared, which inspired this tweet right after I tried it:

Google Squared.  You had me at hello.

Google Squared. You had me at hello.

Further playing around with it convinced me that this would have been a nice tool to have when I was doing ridiculous term papers in high school.  Term papers about crap I didn’t care about.  Basically random stuff.  G^2 is great for that, but really not very helpful otherwise.  It was pretty awesome finding out the number of victims of 30 different serial killers all at once, though.  As quality improves (assuming it does), this could be pretty useful.  Quality has to get there though.  90% of time using it is trial and error trying to find something that works.  I was able to add some sorting algorithms to a square, but couldn’t find a single column to add that actually had something in it (that wasn’t absurd).  Wolfram|Alpha is still the winner in the knowledge engine department, methinks.

Some Google Squared Results

Some Google Squared Results

Reblog this post [with Zemanta]

There has been much ballyhoo in the blogosphere touting Google’s so-called foray into semantic search.  The blog post announcing the new feature doesn’t even mention the word semantics, but it does say it looks at associations and concepts related to your query.  I see no mention of tuples or anything of the sort and the suggested queries are the kind of thing that I would expect to come out of a background closer to document/query classification than semantic analysis.

Related search results for <i>much ado about nothing</i>

Related search results for much ado about nothing

And the results are pretty meh.  Except for taming of the shrew, those results are no-brainers.  That’s query completion quality results.  Of course you can’t judge the whole system by one isolated example.

When PC World and a host of other pop tech media zines started toasting the entrance of Google to the semantic arena, I was excited to see some cool stuff.  Imagine my disappointment when I was not only underwhelmed by the quality of the results, but by the lack of novelty.  How long has that feature been there?  Seems like I’ve seen it for ages.  Maybe it got a technological face-lift (I guess that would be a face-lift on the inside), but it looks about the same as I remember it.  Plus, its placement at the bottom of results page relegates it to search engine hell.

In summary:  boring.  My complaints are first and foremost with those elements of the blagoblag who over-hyped this.  Secondly, I am complaining to Google for not being better.  I am feeling demanding today.

Daniel’s post on it is worth reading.

Since I started blogging almost a year and a half ago, I have been following many blogs. I managed to find some blogs dealing with computational linguistics and natural language processing, but they were few and far between. Since then, I’ve discovered quite a few NLP people that have entered the blagoblag. Here is a non-exhaustive list of the many that I follow.

Many of these bloggers post sporadically and even then only post about CL/NLP occasionally. I’ve tried to organize the list into those who post exclusively on CL/NLP (at least as far as I have followed them) and those who post sporadically on CL/NLP. I would fall into the latter, since I frequently blog about my dogs, regular computer science-y and programming stuff, and other rants. P.S. I group Information Retrieval in with CL/NLP here, but only the blogs I actually read. I’m sure there’s a bazillion I don’t.

If I’ve missed one+, please let me know. I’m always on the lookout. Ditto if you think I’ve miscategorized someone.  I’ve excluded a few that haven’t posted in a while.

I just finished reading about relevance-based language models for information retrieval (Lavrenko and Croft, 2001).  It’s an old paper, but some new stuff I was checking into relied on something else which relied on it — you know how the story goes.

In information retrieval, there are many retrieval models that have been used over the years.  Word on the street is that Google uses the vector space model, where the words in a document are represented as a vector.  Each word is its own dimension and the magnitude along that dimension is some weighting based on the number of times the word appears in that document.  A new query is converted into a vector in this space and how well a document matches the query is the distance between the two vectors.  This glosses over a lot of details, but that’s the general idea.

Another technique is to use language modeling.  A language model is built for each document and then the distance between the language model for a query and the language model for each document is used to rank the most relevant documents.  Again, a multitude of details have been glossed over.  The language modeling approach does a great job, and seems to be more theoretically grounded than the vector space model.  However, the vector space model does really well and there are many optimizations that make it easy to compute for huge datasets.

One thing that retrieval models have tried to do is model the documents relevant to a query.  These are the documents you want to return when a person searches for something.  If you knew the exact set of these documents, your job would be done and information retrieval would be solved.  So, it’s not an easy task.  and is further complicated by the fact that not everybody agrees which are the relevant documents for a particular query.  In probabilistic retrieval models this was done mainly with clunky heuristics that weren’t theoretically sound.  What Lavrenko and Croft (2001) did was create a formal approach to estimating the relevance model without any training data.  Sounds sweet, right?

What it amounts to is something called pseudo-relevance feedback.  Relevance feedback is the case where results are refined for queries based on labeled training data.  We know some relevant documents for certain queries, so we can use that to improve results for new queries.  Pseudo-relevance feedback requires no labeled data, but instead finds a way to simulate having the relevant documents.  Lavrenko and Croft did this by approximating the probability that a word would appear in the set of relevant documents by calculating the probability that the word would co-occur with the queries.

The handy part is you don’t have to do any pesky parameter estimation.  We just have to compute a bunch of probabilities, do some smoothing, and then hold our collective breath.  Check out the paper for details. 


Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR ’01. ACM, New York, NY, 120-127. [pdf]

This is research I did a while ago and presented Monday to fulfill the requirements of my Masters degree.  The presentation only needed to be about 20 minutes, so it was a very short intro.  We have moved on since then, so when I say future work, I really mean future work.  The post is rather lengthy, so I have moved the main content below the jump.


If you follow news on the semantic web or new search engines, you may have heard of hakia. TechCrunch has done a small write up about their new semantic search API. TechCrunch is brutally hard on startups who aren’t fully operational, so there is a lot of criticism in that article that I take with a grain of salt. I like seeing startups open their services with APIs and I think they deserve some benefit of the doubt. Maybe I’m looking at it the wrong way, though, and the fact that TechCrunch does make such a stink ensures the startup will correct the problem asap, rather than farting around for a while. (more…)