Posts Tagged ‘recommender systems’

Github just announced their own version of the Netflix Prize.  Instead of predicting movie ratings, Github wants you to suggest repositories for users to watch.  This is different from the Netflix Prize in a number of ways:

  1. a user watching a repo is similar to a user visiting a page from a search engine – they are implicit endorsements (we assume that doing so means the user actually likes the repo)
  2. we are predicting the likelihood of a user wanting to watch a repo (binary event), rather than how much a user likes a movie
  3. the data set is a lot smaller, and sparsity is a LOT greater (the matrix is 0.006% filled vs. Netflix 1% filled)
  4. you get multiple tries!  they let you pick 10 repos that user may watch and as long as one of them matches, you get credit for it

Already there have been many submissions.  The number one place is currently held by Daniel Haran with 46.9% guessed correctly.  Happy hunting, if you decide to compete.

The prizes are a bottle of Pappy van Winkle bourbon and a large Github account for life.  The bottle of Pappy is making me consider competing.

R.I.P. Movie Rating Prediction

R.I.P. Movie Rating Prediction

There is no longer any reason to bother researching new ways of predicting the ratings users will give to movies.  It’s time to move on to more interesting things.  But seriously, given the fact that the last few miles of the Netflix competition were hard-fought by combining hundreds of different algorithms, is there much value in trying to improve recommender systems in this way, anymore?

I expect that the Netflix Prize data set, if left open to the public, will still be useful for a number of machine learning tasks where the goal is not necessarily improving recommender systems.  So predicting movie ratings may never be really dead.  But it is my hope that that as a goal for research will diminish and the focus will start moving towards other aspects of recommender systems still greatly lacking.  Like building systems that facilitate discovery of new items.

Factoring in the temporal dimension was a big deal in the latter part of the competition.  Sometimes you’re just in the mood for something gloomy.  Or something funny.  Or something ridiculous.  The same movie may totally turn you off a week later.  No machine (biological or mechanical) can predict these swings of emotions in the near future, so why bother?  Flip that around and let’s find ways of improving the search for items matching our mood at the time.

A system that interactively elicits your mood and guides you to matching items would be incredibly useful, don’t you think?

Image representing Netflix as depicted in Crun...
Image via CrunchBase

It looks like some of the top players in the Netflix Prize competition have teamed up and finally broke the 10% improvement barrier.  I know I’m a few days late on this, though not because I didn’t see when it happened.  I’ve been battling an ear infection all week and it has left me dizzy, in pain, and with no energy when I get home from work.  I hesitated before even posting anything about this, since there is little I can add at this point that hasn’t already been said. I’ll just share a few thoughts and experiences for posterity and leave it at that.  I’m also going to eventually make the point that recommender systems are operating under a false assumption, if you read this all the way through. :)

I competed for the prize for a bit, trying out a few ideas with support vector machines and maximum margin matrix factorization [pdf] that never panned out.  We were getting about a 4% improvement over Cinematch, which put us way down the list.  Going further would mean investing a lot of effort into implementing other algorithms, working out the ensemble, etc., unless we came up with some novel algorithm that bridged the gap.  That didn’t seem likely, so I stopped working on it just after leaving school.  I learned a lot about machine learning, matrix factorization, and scaling thanks to the competition, so it was hardly a net loss for me.

The one thing I regret is that the prize encouraged me and my advisor to spend more effort on the competition than we should have, which in turn meant we didn’t spend more time working on something tangibly productive for research.  Bluntly put, I think if we hadn’t wasted so much time on the competition, we could have worked on a different research problem more likely to produce a paper.  The lack of published research on my CV was the main reason I didn’t move on to get my PhD at CMU (at least, that’s what I was told by those close to the decision).  Hindsight is 20/20, and at the time, the shining glory of winning a million bucks and fame was delicious.  It also seemed like we had ideas that “maybe kinda sorta” were going somewhere.  That turned out to not be the case, but when admissions committees look at research experience, negative results = no results.

Many people have lauded the competition by saying that it has encouraged research in collaborative filtering and brought public attention to the field.  I was one of those people.  Others have criticized it for not focusing more on what people actually care about when using recommender systems — getting something useful and having a good experience!  And yes, Daniel Lemire, I’m thinking of you. :)  But I’m convinced that Daniel is right.  I remember reading in the literature that a 10% improvement is about what’s needed for someone to actually be able to notice a difference in recommender systems.  So maybe people will notice a slight improvement in the Netflix recommendations if these ideas are ever implemented.  Which is another problem — most of the stuff that led to winning the prize is so computationally expensive, it’s not really feasible for production.  Netflix recently released some improvements, and I didn’t notice a damned thing.  They still recommended me Daft Punk’s Electroma, which was a mind-numbing screen-turd.  And I must have seen every good sci-fi movie ever made, because there are no more recommendations for me in that category.  I have trouble believing that.

The point of a recommender system really shouldn’t be just to guess what I might happen to rate something at a given time.  The fact that introducing time makes such a big difference in improving performance in the competition seems like a ginormous red flag to me.  Sure I can look back in time and say “on day X, people liked movies about killing terrorists.”  The qualifying set in the competition asked you to predict the rating for a movie by a user on a given date in the past.  Remember what I said about hindsight being 20/20?  How about you predict what I will rate a movie this coming weekend.  See the problem?

I will sound the HCIR trumpets and say that what recommender systems should really be looking at is improving exploration.  When I go looking for a movie to a watch, or a pair of shoes to buy, I already know what I like in general.  Let me pick a starting point and then show me useful ways of narrowing down my search to the cool thing I really want.  Clerk dogs is a good first step on this path, though I think we’re going to have to move away from curated knowledge before this is going to catch fire.

Maybe I have this all wrong.  Maybe we need to discard the notion of recommender systems, since they are operating under the wrong premise.  We don’t need a machine to recommend something it thinks we’ll like.  We need a machine that will help us discover something we’ll like.  We need to be making discovery engines.  (Replace recommender system with search engine in most of what I just said and you’ll find that I have really been sounding the HCIR trumpets.)

Reblog this post [with Zemanta]

I happened on clerk dogs, a new movie recommender, the other day.  They are still in beta and are missing data in many key areas of film, but they are definitely worth checking out.  Like Pandora, clerk dogs uses human editors to classify movies along several dimensions.  Indeed, the founder Stuart Skorman (also founder of calls it the movie genome project.  Of course, another movie recommender (also, still in beta) is using that term.  Stuart goes on to say:

We have designed this innovative search engine for the movie buffs who have seen so many movies that they’re having a hard time finding new ones (or old ones) that they will really love. I hope you find hundreds of great movies!



This is a problem I’ve been noticing with Netflix lately.  I would be pretty sure I’ve seen every sci-fi movie worth seeing that has been released if all I had to go on was Netflix’s recommendations.  I gave clerk dogs a shot, starting with my favorite movie.  They seem to have done a decent job with classifying Brazil and a number of the similar movies they have listed are indeed similar in many ways to it.  When I first visited the site, they showed the similar movies on a grid and said whether it was “more dark”, “less disturbing”, “more violent”, and so on.  If that functionality still exists, I can’t find it.

However, you can “Mash it” to find movies that fit your mood.  Pick your base movie and mash it.  Then change the sliding scale to decide what sort of differences you are looking for.  Can you say kickass?

I applaud clerk dogs for a job well done.  I’ve already found a number of movies that Netflix was hiding from me.  I added them to my Netflix queue though so I guess they are still benefitting.

This is research I did a while ago and presented Monday to fulfill the requirements of my Masters degree.  The presentation only needed to be about 20 minutes, so it was a very short intro.  We have moved on since then, so when I say future work, I really mean future work.  The post is rather lengthy, so I have moved the main content below the jump.


Peter Turney posted recently on the logic of attributional and relational similarity. Attributes are features or characteristics of a single entity. Relations describe some connection between two entities, such as a comparison. We’ll denote a relation between two entities A and B as A:B. A relational similarity between two groups A, B and C,D will be denoted as A:B::C:D. This is the standard SAT-style proportional analogy: A is to B as C is to D. An attributional similarity indicates that two entities share the same attribute (this could be to varying degrees, but in the boolean case, it’s either shared or it isn’t). An attributional similarity between A and B will be denoted as A~B. This is like saying \forall Z, A:Z::B:Z. I’m just giving a brief introduction here, but this is all in Peter’s post to greater detail, so I recommend reading that for more information.

This got me thinking about collaborative filtering (because, well, I’ve been thinking about it all the time for the past two years). Collaborative filtering exploits similarities between users to predict preferences for items the user has not seen. In the case of movie recommendations, like with Netflix, this means that users can recommend movies they have seen to similar users who have not seen those movies. There are many ways of doing this. At the heart of it, however, is this notion of relational and attributional similarity.

A: base user
B: movies rated by A
C: some set of other users
D: movies rated by C

We can’t just say that A:B::C:D, since A and C may be nothing like each other. If we constrain it to users with attributional similarity, then we arrive at the definition of collaborative filtering: A~C & A:B::C:D. Logically, it follows that B~D also holds. See Peter’s post for some properties of proportional analogies that make this more clear.

In the non-binary case, we can choose C to be a set of users whose similarity varies with A. Also, our measure of what exactly constitutes similarity can be any number of different metrics. From here, it seems pretty clear that the limit of collaborative filtering is bounded by the attributional similarity A~C. If (A~C) & (A = C) (complete similarity) then it follows that B = D or else A \neq C. If A \neq C then does it logically follow that B \neq D? I guess it depends on the similarity metric and how we are defining the differences in the sets of movies and the differences in the sets of users.

I wonder if there has been any work done in this area? I wasn’t able to find anything, but maybe I’m just not searching for the right thing. Is it even worth pursuing?

The standard way of doing human evaluations of machine translation (MT) quality for the past few years has been to have human judges grade each sentence of MT output against a reference translation on measures of adequacy and fluency.  Adequacy is the level at which the translation conveys the information contained in the original (source language) sentence.  Fluency is the level at which the translation conforms to the standards of the target language (in most cases, English).  The judges give each sentence a score for both in the range of 1-5, similar to a movie rating.   It became apparent early on that not even humans correlate well with each other.  One judge may be sparing with the number of 5’s he gives out, while another may give them freely.  The same problem crops up in recommender systems, which I have talked about in the past.

It matters how well judges can score MT output, because that is the evaluation standard by which automatic metrics for MT evaluation are judged.  The better an MT metric correlates with how human judges would rate sentences, the better.  This not only helps properly gauge the quality of one MT system over another, it drives improvements in MT systems.  If judges don’t correlate well with each other, how can we expect automatic methods to correlate well with them?  The standard practice now is to normalize the judges’ scores in order to help remove some of the bias in the way each judge uses the rating scale.

Vilar et al. (2007) propose a new way of handling human assessments of MT quality:  binary system comparisons.  Instead of giving a rating on a scale of 1-5, they propose that judges compare the output from two MT systems and simply state which is better.  The definition of what constitutes “better” is left vague, but judges are instructed not to specifically look for adequacy or fluency.  By mixing up the sentences so that one judge is not judging the output of the same system (which could introduce additional bias), this method should simplify the task of evaluating MT quality while leading to better intercoder agreement.

The results were favorable and the advantages of this method seem to outweigh the fact that it requires more comparisons than the previous method required ratings.  The total number of ratings for the previous method was two per sentence:  O(n), where n is the number of systems (the number of sentences is constant).  Binary system comparisons requires more ratings because the systems have to be ordered:  O(log n!).  In most MT comparison campaigns the difference is negligible, but it becomes increasingly pronounced as n increases.

What would be interesting to me is a movie recommendation system that asks you a similar question:  which do you like better?  Of course, this means more work for you.  The standard approaches for collaborative filtering would have to change.  For example, doing singular value decomposition on a matrix of ratings would no longer be possible when all you have are comparisons between movies.  Also, people will still disagree with themselves (in theory).  You might say National Treasure was better than Star Trek VI, which was better than Indiana Jones and the Last Crusade, which was better than National Treasure.  You’d have to find some way to deal with cycles like this (ignoring it is one way).


Vilar, D., G. Leusch, H. Ney, and R. E. Banchs. 2007. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation. 96-103. [pdf]