Posts Tagged ‘machine learning’

works-on-my-machine-starburstA while back I ported David Blei’s lda-c code for performing Latent Dirichlet Allocation to Ruby.  Basically I just wrapped the C methods in a Ruby class, turned it into a gem, and called it a day.  The result was a bit ugly and unwieldy, like most research code.  A few months later, Todd Fisher came along and discovered a couple bugs and memory leaks in the C code, for which I am very grateful.  I had been toying with the idea of improving the Ruby code, and embarked on a mission to do so.  The result is a hopefully much cleaner gem that can be used right out of the box with little screwing around.

Unfortunately, I did something I’m ashamed of.  Ruby gems are notorious for breaking backwards compatibility, and I have done just that.  The good news is, your code will almost work, assuming you didn’t start diving into the Document and Corpus classes too heavily.  If you did, then you will probably experience a lot of breakage.  The result, I hope is a more sensical implementation, however, so maybe you won’t hate me.  Of course, I could be wrong and my implementation is still crap.  If that’s the case, please let me know what needs to be improved.

To install the gem:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

Enjoy!

Reblog this post [with Zemanta]

R.I.P. Movie Rating Prediction

R.I.P. Movie Rating Prediction

There is no longer any reason to bother researching new ways of predicting the ratings users will give to movies.  It’s time to move on to more interesting things.  But seriously, given the fact that the last few miles of the Netflix competition were hard-fought by combining hundreds of different algorithms, is there much value in trying to improve recommender systems in this way, anymore?

I expect that the Netflix Prize data set, if left open to the public, will still be useful for a number of machine learning tasks where the goal is not necessarily improving recommender systems.  So predicting movie ratings may never be really dead.  But it is my hope that that as a goal for research will diminish and the focus will start moving towards other aspects of recommender systems still greatly lacking.  Like building systems that facilitate discovery of new items.

Factoring in the temporal dimension was a big deal in the latter part of the competition.  Sometimes you’re just in the mood for something gloomy.  Or something funny.  Or something ridiculous.  The same movie may totally turn you off a week later.  No machine (biological or mechanical) can predict these swings of emotions in the near future, so why bother?  Flip that around and let’s find ways of improving the search for items matching our mood at the time.

A system that interactively elicits your mood and guides you to matching items would be incredibly useful, don’t you think?

Image representing Netflix as depicted in Crun...
Image via CrunchBase

It looks like some of the top players in the Netflix Prize competition have teamed up and finally broke the 10% improvement barrier.  I know I’m a few days late on this, though not because I didn’t see when it happened.  I’ve been battling an ear infection all week and it has left me dizzy, in pain, and with no energy when I get home from work.  I hesitated before even posting anything about this, since there is little I can add at this point that hasn’t already been said. I’ll just share a few thoughts and experiences for posterity and leave it at that.  I’m also going to eventually make the point that recommender systems are operating under a false assumption, if you read this all the way through. :)

I competed for the prize for a bit, trying out a few ideas with support vector machines and maximum margin matrix factorization [pdf] that never panned out.  We were getting about a 4% improvement over Cinematch, which put us way down the list.  Going further would mean investing a lot of effort into implementing other algorithms, working out the ensemble, etc., unless we came up with some novel algorithm that bridged the gap.  That didn’t seem likely, so I stopped working on it just after leaving school.  I learned a lot about machine learning, matrix factorization, and scaling thanks to the competition, so it was hardly a net loss for me.

The one thing I regret is that the prize encouraged me and my advisor to spend more effort on the competition than we should have, which in turn meant we didn’t spend more time working on something tangibly productive for research.  Bluntly put, I think if we hadn’t wasted so much time on the competition, we could have worked on a different research problem more likely to produce a paper.  The lack of published research on my CV was the main reason I didn’t move on to get my PhD at CMU (at least, that’s what I was told by those close to the decision).  Hindsight is 20/20, and at the time, the shining glory of winning a million bucks and fame was delicious.  It also seemed like we had ideas that “maybe kinda sorta” were going somewhere.  That turned out to not be the case, but when admissions committees look at research experience, negative results = no results.

Many people have lauded the competition by saying that it has encouraged research in collaborative filtering and brought public attention to the field.  I was one of those people.  Others have criticized it for not focusing more on what people actually care about when using recommender systems — getting something useful and having a good experience!  And yes, Daniel Lemire, I’m thinking of you. :)  But I’m convinced that Daniel is right.  I remember reading in the literature that a 10% improvement is about what’s needed for someone to actually be able to notice a difference in recommender systems.  So maybe people will notice a slight improvement in the Netflix recommendations if these ideas are ever implemented.  Which is another problem — most of the stuff that led to winning the prize is so computationally expensive, it’s not really feasible for production.  Netflix recently released some improvements, and I didn’t notice a damned thing.  They still recommended me Daft Punk’s Electroma, which was a mind-numbing screen-turd.  And I must have seen every good sci-fi movie ever made, because there are no more recommendations for me in that category.  I have trouble believing that.

The point of a recommender system really shouldn’t be just to guess what I might happen to rate something at a given time.  The fact that introducing time makes such a big difference in improving performance in the competition seems like a ginormous red flag to me.  Sure I can look back in time and say “on day X, people liked movies about killing terrorists.”  The qualifying set in the competition asked you to predict the rating for a movie by a user on a given date in the past.  Remember what I said about hindsight being 20/20?  How about you predict what I will rate a movie this coming weekend.  See the problem?

I will sound the HCIR trumpets and say that what recommender systems should really be looking at is improving exploration.  When I go looking for a movie to a watch, or a pair of shoes to buy, I already know what I like in general.  Let me pick a starting point and then show me useful ways of narrowing down my search to the cool thing I really want.  Clerk dogs is a good first step on this path, though I think we’re going to have to move away from curated knowledge before this is going to catch fire.

Maybe I have this all wrong.  Maybe we need to discard the notion of recommender systems, since they are operating under the wrong premise.  We don’t need a machine to recommend something it thinks we’ll like.  We need a machine that will help us discover something we’ll like.  We need to be making discovery engines.  (Replace recommender system with search engine in most of what I just said and you’ll find that I have really been sounding the HCIR trumpets.)

Reblog this post [with Zemanta]

Since Ruby is my new favorite toy, I thought it would be fun to try my hand at C extensions.  I came across David Blei’s C code for Latent Dirichlet Allocation and it looked simple enough to convert into a Ruby module.  Ruby makes it very easy to wrap some C functions (which is good to know if you need a really fast implementation of something that gets called alot).  Wrapping a C library is slightly harder, but not horribly so.  Probably most of my challenge was the fact that it’s been so long since I wrote anything in C.

Since the code is open source, I decided to release the Ruby wrapper as a gem on GitHub.  I chose GitHub over RubyForge, because it uses Git and freakin’ rocks.  But GitHub is a story for another day.  Feel free to contribute and extend the project if you’re so inclined.

A basic usage example:

require 'lda'
# create an Lda object for training
lda = Lda::Lda.new
corpus = Lda::Corpus.new("data/data_file.dat")
lda.corpus = corpus
# run EM algorithm using random starting points
lda.em("random")
lda.load_vocabulary("data/vocab.txt")
# print the topic 20 words per topic
lda.print_topics(20)

You can also download the gem from GitHub directly:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

You only need the first line if you haven’t added GitHub to your sources before.

Latent Dirichlet Allocation (LDA) is an unsupervised method of finding topics in a collection of documents.  It posits a set of possible topics from which a subset are selected for each document.  This selected mixture of topics represents the topics discussed in the document, and each word in the document is generated by this mixture.  As a quick example, if we had a short document with the topics geology and astronomy:

The rover traveled many millions of miles through space to arrive at Mars. Once there, it collected soil samples and examined them to determine if liquid water had ever been present on the surface.

In this case, the topic astronomy is represented in red and geology in green.  LDA finds these latent topics in an unsupervised fashion using the EM algorithm.  EM is a two step process for estimating parameters in a statistical model.  The nice thing about it is that it’s guaranteed to converge to a local maximum (not necessarily the global!).  However, it can take a while to converge, depending on the size and nature of the data and model.  While I was in school, EM was one of the most confusing concepts, and I’m still not 100% on it, but it makes a lot more sense now than it did before.

In the context of LDA, EM is basically doing two things.  First, we come up with an idea about how the topics are distributed.  Next, we look at the actual words and compute the probabilities in the model based on those hypothesized topics.  Eventually we converge to a local “best” set of topics.  These may not correspond to realistic topics, but they maximize the negative log probability of the model.  Usually LDA does a pretty good job of finding explainable topics given a decent amount of data.

For more details about LDA, check out the paper by Blei et al (2003).  LDA has been extended in a number of different directions since the original paper, so it’s essential reading if you’re doing any sort of topic modeling [citation needed].

References

D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022. [pdf]

In my previous post about sentiment polarity, I talked about results from Pang et al (2002).  One of the conclusions in that paper was that the presence of sentiment words led to better classification results than the frequency of words.  In my experiment in that post, I used tf-idf, a frequency-based measure.  I ran some additional experiments a few days ago when I woke up way too early using presence (binary) weights.  The result was a slight improvement over tf-idf:  86.1% versus 85.7%.  If we ignore document frequency and just use term frequency, the results were terrible:  about 76%.  So presence versus term frequency is much better, but presence versus tf-idf isn’t much better.

Or is it?  Even more experiments with tf-idf produced an accuracy of 86.8%.  All of this is based on 10-fold cross validation using the Pang and Lee (2004) data set, just so we’re clear.  This seems to contradict their results. Of course, I wasn’t able to reproduce their results identically, even though I am using the folds exactly as they described.  This may be due to a pre-processing step I am skipping (or doing extra).  They mention length-normalizing the vectors, which I don’t usually bother with.  It’s an oft-suggested thing to do with svms, but I have yet to have it actually help me.

So I tried normalizing.  It hurt results for tf-idf, dropping it to 86.6%.  It made no difference for presence, which stayed at 86.1%.  No surprises there.

My results contradict Pang et al (2002) in that tf-idf (frequency-based) out-performs presence.  If I made a mistake, where was it?  I wish their source code were made available.  I guess I could always ask. There is usually some voodoo involved that isn’t obvious (to me) in the paper.  This is a-whole-nother topic, one discussed with far more eloquence (pdf warning) by Ted Pedersen in the latest issue of Computational Linguistics.

References

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs Up? Sentiment Classification Using Machine Learning Techniques.” In Proceedings of the ACL 02 conference on Empirical Methods in Natural Language Processing – Volume 10, July 2002. [pdf]

Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.” In Proceedings of the ACL, 2004. [pdf]

I’ve begun learning ruby for my new job, a language that doesn’t seem to have really gotten any traction in the NLP community (at least not that I’ve heard).  I had been using python for my NLP stuff (homework and projects) and Java for my recommender system stuff.  In retrospect, I could have used python for the recommender stuff, but I wasn’t aware of some speed-ups so resorted to Java.  Of course, the recommender stuff isn’t strictly NLP.  Ruby is just as well suited as python and seems a lot better than Java for many tasks (though Java certainly has its place).  At the very least, a scripting language like ruby or python is great for prototyping.  It’s easy to test new ideas quickly.

I was reading through Pang et al (2002), which deals with classifying movie reviews as positive or negative.  They look at three machine learning approaches:  Naive Bayes, Maximum Entropy classifier and Support Vector Machines.  This seemed like a good opportunity to try out my nascent ruby skills, since it’s the kind of crap I can roll together in python in short order (and do all the time).  So I downloaded the data for the paper (actually I downloaded the later data from the 2004 paper).  There are 1000 positive and 1000 negative movie reviews.  The task is to train a classifier to determine whether a review expresses a positive opinion (the author liked the movie) or a negative opinion (the author did not like the movie).  I chose to just use SVMs since they do best for this task according to the paper, they do really well for text categorization, and they are easy to use and download.

The results were quite nice.  Ruby turned out to be just as handy as python at manipulating text and dealing with crossfold validation:  the two main “challenges” in implementing this paper.  I used tf-idf for weighting the features and thresholded document frequency to discard words that didn’t appear in at least three reviews.  The result was that I achieved about 85.7% accuracy using the same cross validation setup described in their followup work (Pang and Lee, 2004).  In other words, the classifier could correctly guess the opinion orientation of reviews as positive or negative nearly 86% of the time.

Pang et al (2002) discussed some of their errors and hypothesized that discourse analysis might improve results, since reviewers often use sarcasm.  There’s also the case where authors use a “thwarted expectations” narrative.  This offered me one of the few chuckles I’ve ever had while reading a research paper:

“I hate the Spice Girls. … [3 things the author hates about them] …  Why I saw this movie is a really, really, really long story, but I did and one would think I’d despise every minute of it.  But… Okay, I’m really ashamed of it, but I enjoyed it.  I mean, I admit it’s a really awful movie …the ninth floor of hell… The plot is such a mess that it’s terrible.  But I loved it.”

References

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  “Thumbs Up?  Sentiment Classification Using Machine Learning Techniques.”  In Proceedings of the ACL 02 conference on Empirical Methods in Natural Language Processing – Volume 10, July 2002. [pdf]

Bo Pang and Lillian Lee.  “A Sentimental Education:  Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.”  In Proceedings of the ACL, 2004. [pdf]