Posts Tagged ‘netflix prize’

Github just announced their own version of the Netflix Prize.  Instead of predicting movie ratings, Github wants you to suggest repositories for users to watch.  This is different from the Netflix Prize in a number of ways:

  1. a user watching a repo is similar to a user visiting a page from a search engine – they are implicit endorsements (we assume that doing so means the user actually likes the repo)
  2. we are predicting the likelihood of a user wanting to watch a repo (binary event), rather than how much a user likes a movie
  3. the data set is a lot smaller, and sparsity is a LOT greater (the matrix is 0.006% filled vs. Netflix 1% filled)
  4. you get multiple tries!  they let you pick 10 repos that user may watch and as long as one of them matches, you get credit for it

Already there have been many submissions.  The number one place is currently held by Daniel Haran with 46.9% guessed correctly.  Happy hunting, if you decide to compete.

The prizes are a bottle of Pappy van Winkle bourbon and a large Github account for life.  The bottle of Pappy is making me consider competing.

R.I.P. Movie Rating Prediction

R.I.P. Movie Rating Prediction

There is no longer any reason to bother researching new ways of predicting the ratings users will give to movies.  It’s time to move on to more interesting things.  But seriously, given the fact that the last few miles of the Netflix competition were hard-fought by combining hundreds of different algorithms, is there much value in trying to improve recommender systems in this way, anymore?

I expect that the Netflix Prize data set, if left open to the public, will still be useful for a number of machine learning tasks where the goal is not necessarily improving recommender systems.  So predicting movie ratings may never be really dead.  But it is my hope that that as a goal for research will diminish and the focus will start moving towards other aspects of recommender systems still greatly lacking.  Like building systems that facilitate discovery of new items.

Factoring in the temporal dimension was a big deal in the latter part of the competition.  Sometimes you’re just in the mood for something gloomy.  Or something funny.  Or something ridiculous.  The same movie may totally turn you off a week later.  No machine (biological or mechanical) can predict these swings of emotions in the near future, so why bother?  Flip that around and let’s find ways of improving the search for items matching our mood at the time.

A system that interactively elicits your mood and guides you to matching items would be incredibly useful, don’t you think?

Image representing Netflix as depicted in Crun...
Image via CrunchBase

It looks like some of the top players in the Netflix Prize competition have teamed up and finally broke the 10% improvement barrier.  I know I’m a few days late on this, though not because I didn’t see when it happened.  I’ve been battling an ear infection all week and it has left me dizzy, in pain, and with no energy when I get home from work.  I hesitated before even posting anything about this, since there is little I can add at this point that hasn’t already been said. I’ll just share a few thoughts and experiences for posterity and leave it at that.  I’m also going to eventually make the point that recommender systems are operating under a false assumption, if you read this all the way through. :)

I competed for the prize for a bit, trying out a few ideas with support vector machines and maximum margin matrix factorization [pdf] that never panned out.  We were getting about a 4% improvement over Cinematch, which put us way down the list.  Going further would mean investing a lot of effort into implementing other algorithms, working out the ensemble, etc., unless we came up with some novel algorithm that bridged the gap.  That didn’t seem likely, so I stopped working on it just after leaving school.  I learned a lot about machine learning, matrix factorization, and scaling thanks to the competition, so it was hardly a net loss for me.

The one thing I regret is that the prize encouraged me and my advisor to spend more effort on the competition than we should have, which in turn meant we didn’t spend more time working on something tangibly productive for research.  Bluntly put, I think if we hadn’t wasted so much time on the competition, we could have worked on a different research problem more likely to produce a paper.  The lack of published research on my CV was the main reason I didn’t move on to get my PhD at CMU (at least, that’s what I was told by those close to the decision).  Hindsight is 20/20, and at the time, the shining glory of winning a million bucks and fame was delicious.  It also seemed like we had ideas that “maybe kinda sorta” were going somewhere.  That turned out to not be the case, but when admissions committees look at research experience, negative results = no results.

Many people have lauded the competition by saying that it has encouraged research in collaborative filtering and brought public attention to the field.  I was one of those people.  Others have criticized it for not focusing more on what people actually care about when using recommender systems — getting something useful and having a good experience!  And yes, Daniel Lemire, I’m thinking of you. :)  But I’m convinced that Daniel is right.  I remember reading in the literature that a 10% improvement is about what’s needed for someone to actually be able to notice a difference in recommender systems.  So maybe people will notice a slight improvement in the Netflix recommendations if these ideas are ever implemented.  Which is another problem — most of the stuff that led to winning the prize is so computationally expensive, it’s not really feasible for production.  Netflix recently released some improvements, and I didn’t notice a damned thing.  They still recommended me Daft Punk’s Electroma, which was a mind-numbing screen-turd.  And I must have seen every good sci-fi movie ever made, because there are no more recommendations for me in that category.  I have trouble believing that.

The point of a recommender system really shouldn’t be just to guess what I might happen to rate something at a given time.  The fact that introducing time makes such a big difference in improving performance in the competition seems like a ginormous red flag to me.  Sure I can look back in time and say “on day X, people liked movies about killing terrorists.”  The qualifying set in the competition asked you to predict the rating for a movie by a user on a given date in the past.  Remember what I said about hindsight being 20/20?  How about you predict what I will rate a movie this coming weekend.  See the problem?

I will sound the HCIR trumpets and say that what recommender systems should really be looking at is improving exploration.  When I go looking for a movie to a watch, or a pair of shoes to buy, I already know what I like in general.  Let me pick a starting point and then show me useful ways of narrowing down my search to the cool thing I really want.  Clerk dogs is a good first step on this path, though I think we’re going to have to move away from curated knowledge before this is going to catch fire.

Maybe I have this all wrong.  Maybe we need to discard the notion of recommender systems, since they are operating under the wrong premise.  We don’t need a machine to recommend something it thinks we’ll like.  We need a machine that will help us discover something we’ll like.  We need to be making discovery engines.  (Replace recommender system with search engine in most of what I just said and you’ll find that I have really been sounding the HCIR trumpets.)

Reblog this post [with Zemanta]

Peter Turney posted recently on the logic of attributional and relational similarity. Attributes are features or characteristics of a single entity. Relations describe some connection between two entities, such as a comparison. We’ll denote a relation between two entities A and B as A:B. A relational similarity between two groups A, B and C,D will be denoted as A:B::C:D. This is the standard SAT-style proportional analogy: A is to B as C is to D. An attributional similarity indicates that two entities share the same attribute (this could be to varying degrees, but in the boolean case, it’s either shared or it isn’t). An attributional similarity between A and B will be denoted as A~B. This is like saying \forall Z, A:Z::B:Z. I’m just giving a brief introduction here, but this is all in Peter’s post to greater detail, so I recommend reading that for more information.

This got me thinking about collaborative filtering (because, well, I’ve been thinking about it all the time for the past two years). Collaborative filtering exploits similarities between users to predict preferences for items the user has not seen. In the case of movie recommendations, like with Netflix, this means that users can recommend movies they have seen to similar users who have not seen those movies. There are many ways of doing this. At the heart of it, however, is this notion of relational and attributional similarity.

A: base user
B: movies rated by A
C: some set of other users
D: movies rated by C

We can’t just say that A:B::C:D, since A and C may be nothing like each other. If we constrain it to users with attributional similarity, then we arrive at the definition of collaborative filtering: A~C & A:B::C:D. Logically, it follows that B~D also holds. See Peter’s post for some properties of proportional analogies that make this more clear.

In the non-binary case, we can choose C to be a set of users whose similarity varies with A. Also, our measure of what exactly constitutes similarity can be any number of different metrics. From here, it seems pretty clear that the limit of collaborative filtering is bounded by the attributional similarity A~C. If (A~C) & (A = C) (complete similarity) then it follows that B = D or else A \neq C. If A \neq C then does it logically follow that B \neq D? I guess it depends on the similarity metric and how we are defining the differences in the sets of movies and the differences in the sets of users.

I wonder if there has been any work done in this area? I wasn’t able to find anything, but maybe I’m just not searching for the right thing. Is it even worth pursuing?

The second workshop on large scale recommender systems will be at SIGKDD in Las Vegas this year.  One of the main topics is the Netflix competition and Jim Bennett of Netflix is one of the co-chairs, so there should be some interesting stuff in that area.  Plus all the other cool stuff going on with recommender systems (yes, there is other work being done on them!).  Another of the co-chairs is Daniel Lemire, whose blog never disappoints.

The paper deadline for the workshop is May 30th.

I’ve been messing around with recommender systems for the past year and a half, but not using the kNN (k-Nearest Neighbors) algorithm. However, my current homework assignment for my Information Retrieval class is to implement kNN for a subset of the Netflix Prize data. The data we are working with is about 800k ratings, which is slightly smaller than the MovieLens dataset, which was the previous dataset of choice for research on movie recommender systems. The entire Netflix data set dwarfs the MovieLens set by a factor of 100, so it is quickly replacing MovieLens in papers. The Netflix data is much sparser than the MovieLens, which changes things, as well.

kNN is a fairly simple machine learning algorithm to implement. On a dataset the size of Netflix, it’s still easy to do stupid things that cause it to take forever. Recommender systems typically match users to movies on a scale, which is the user’s rating for that item. In the case of Netflix, the scale is 1 (hate) to 5 (love). For the Netflix Prize, the goal is to guess user’s ratings on a hidden set as correctly as possible (according to root mean squared error (RMSE)). One way of approaching the problem is to create a user-items matrix where the rows correspond to a user, the columns to an item (movie) and the value in each cell is the user’s rating for that item. If the user has not rated the item, it is assigned a zero. Now, we can split this matrix up into vectors, where each row vector represents a user. kNN seeks to find similar users to other users (or similar movies to other movies) according to some metric over these vectors. I won’t bother going into the metrics in detail, but they include cosine similarity, Euclidean distance, and Pearson correlation. The ratings from the chosen k users are combined (either by simply averaging or using some weighted average) to form the prediction for a movie the user has not yet rated.

So on the test dataset for this assignment, I built three models that had the following RMSE scores:

Model 1 0.9831
Model 2 1.0371
Model 3 0.9768

Just guessing the average rating for each movie gives an RMSE of 1.0 in this case, so Models 1 and 3 improve over the baseline, while Model 2 does worse. The best performing teams in the Netflix prize use ensemble methods to combine various models. The simple way to do this is just with a linear combination. So given models {m1, m2, m3} and weights {w1, w2, w3}, the ensemble prediction would be w1m1 + w2m2 + w3m3 (where w1+w2+w3=1.0). This was the first time I had tried ensembles with recommender systems as well, so imagine my surprise when I hit 0.9469 RMSE with my best choice of w’s. Of course, this is nowhere near the number needed to actually claim the prize, but it was a nice demonstration of the power of ensemble methods. I recommend checking out the proceedings of last year’s KDD Cup if you’re interested.

I am a fan of good beer. In this post I am going to talk about my ideas for how to improve websites that offer ratings for different varieties of beer, and how I think recommender systems would improve their service.

Why I care

Whenever I’m asked what kind of beer I like, I experience a moment of awkwardness. Because I don’t just like good beer, I hate bad mediocre beer. Usually the person asking is a beer noob and I don’t want to sound too snobby by throwing beer jargon at them they probably won’t understand. So I say something along the lines of “I like the more expensive stuff, like from small breweries or imports.” The response is usually something about Sam Adams. I used to enjoy Sam Adams Boston Lager, but I can barely stomach it anymore. There are a couple Sam Adams brews that aren’t half bad, but the Boston Lager no longer cuts it for me. (more…)

Greg Linden and Daniel Lemire have both written a little about the Netflix Prize and whether the systems that are doing the best are really worth anything. The KorBell system that recently won the Progress Prize consists of 107 different parts in an ensemble system (Note: the team of Bob Bell and Yehuda Koren at AT&T goes by BellKor and KorBell on the Netflix leaderboard). The paper is interesting for two reasons: the ensemble method being used and the fact that only about 3 or 4 of those components are doing the heavy lifting. Actually, I have no idea whether the actual ensemble algorithm they use would be especially interesting to anyone else, but as I have no experience with ensembles in this context, it was interesting to me. (more…)