Posts Tagged ‘research’

Image representing Netflix as depicted in Crun...
Image via CrunchBase

It looks like some of the top players in the Netflix Prize competition have teamed up and finally broke the 10% improvement barrier.  I know I’m a few days late on this, though not because I didn’t see when it happened.  I’ve been battling an ear infection all week and it has left me dizzy, in pain, and with no energy when I get home from work.  I hesitated before even posting anything about this, since there is little I can add at this point that hasn’t already been said. I’ll just share a few thoughts and experiences for posterity and leave it at that.  I’m also going to eventually make the point that recommender systems are operating under a false assumption, if you read this all the way through. :)

I competed for the prize for a bit, trying out a few ideas with support vector machines and maximum margin matrix factorization [pdf] that never panned out.  We were getting about a 4% improvement over Cinematch, which put us way down the list.  Going further would mean investing a lot of effort into implementing other algorithms, working out the ensemble, etc., unless we came up with some novel algorithm that bridged the gap.  That didn’t seem likely, so I stopped working on it just after leaving school.  I learned a lot about machine learning, matrix factorization, and scaling thanks to the competition, so it was hardly a net loss for me.

The one thing I regret is that the prize encouraged me and my advisor to spend more effort on the competition than we should have, which in turn meant we didn’t spend more time working on something tangibly productive for research.  Bluntly put, I think if we hadn’t wasted so much time on the competition, we could have worked on a different research problem more likely to produce a paper.  The lack of published research on my CV was the main reason I didn’t move on to get my PhD at CMU (at least, that’s what I was told by those close to the decision).  Hindsight is 20/20, and at the time, the shining glory of winning a million bucks and fame was delicious.  It also seemed like we had ideas that “maybe kinda sorta” were going somewhere.  That turned out to not be the case, but when admissions committees look at research experience, negative results = no results.

Many people have lauded the competition by saying that it has encouraged research in collaborative filtering and brought public attention to the field.  I was one of those people.  Others have criticized it for not focusing more on what people actually care about when using recommender systems — getting something useful and having a good experience!  And yes, Daniel Lemire, I’m thinking of you. :)  But I’m convinced that Daniel is right.  I remember reading in the literature that a 10% improvement is about what’s needed for someone to actually be able to notice a difference in recommender systems.  So maybe people will notice a slight improvement in the Netflix recommendations if these ideas are ever implemented.  Which is another problem — most of the stuff that led to winning the prize is so computationally expensive, it’s not really feasible for production.  Netflix recently released some improvements, and I didn’t notice a damned thing.  They still recommended me Daft Punk’s Electroma, which was a mind-numbing screen-turd.  And I must have seen every good sci-fi movie ever made, because there are no more recommendations for me in that category.  I have trouble believing that.

The point of a recommender system really shouldn’t be just to guess what I might happen to rate something at a given time.  The fact that introducing time makes such a big difference in improving performance in the competition seems like a ginormous red flag to me.  Sure I can look back in time and say “on day X, people liked movies about killing terrorists.”  The qualifying set in the competition asked you to predict the rating for a movie by a user on a given date in the past.  Remember what I said about hindsight being 20/20?  How about you predict what I will rate a movie this coming weekend.  See the problem?

I will sound the HCIR trumpets and say that what recommender systems should really be looking at is improving exploration.  When I go looking for a movie to a watch, or a pair of shoes to buy, I already know what I like in general.  Let me pick a starting point and then show me useful ways of narrowing down my search to the cool thing I really want.  Clerk dogs is a good first step on this path, though I think we’re going to have to move away from curated knowledge before this is going to catch fire.

Maybe I have this all wrong.  Maybe we need to discard the notion of recommender systems, since they are operating under the wrong premise.  We don’t need a machine to recommend something it thinks we’ll like.  We need a machine that will help us discover something we’ll like.  We need to be making discovery engines.  (Replace recommender system with search engine in most of what I just said and you’ll find that I have really been sounding the HCIR trumpets.)

Reblog this post [with Zemanta]

The papers are out for WWW2009 (and have been for a bit), but I’ve only just gotten a chance to start looking at them. First of all, kudos to the ePrints people for improving the presentation of conference proceedings. This is a lot easier than having to do a Google Scholar search for each paper and hoping I find something, like I have to do with some conferences.

WWW2009 Madrid

WWW2009 Madrid

There are a lot of very interesting ones, and here are a few that bubbled to the top of my reading list:

Data Mining Track

Semantic/Data Web

Social Networks and Web 2.0

Reblog this post [with Zemanta]

Luis von Ahn has an insightful post lamenting the fact that we are holding onto a paper-world philosophy of academic publishing in a digital age. He kicks out the fledgling idea that a “wiki, karma, and a voting method like reddit” hybrid might supplant our current method. I’m always a little confused by the reluctance to change publishing models in academia. Granted, I have never struggled to get tenure at a university, nor is it remotely likely that that will ever be something I do. But still, computer scientists of all people, should be willing to change and adopt a more sensible model. It turns out we’re just people after all.

What might a wikarmeddit version of academic publishing look like? A good place to start might be Stack Overflow. They are a self-proclaimed combination of wiki, blog, reddit, forum, and have karma. Perfect, right?

The benefits of peer review by the herd are great, but not without pitfalls. First of all, you can be herd-reviewed by morons. Moron 1 might think everything Researcher A publishes is GOLD and gives the thumbs-up no matter how badly the research was done. Ditto on the flipside, with Moron 2 hating everything Researcher A does. I’m not really being fair. The number of real morons who bother with this sort of thing is probably low, but the number of non-experts is a different matter.

On the other hand, open sourcing the research results like this allows all sorts of insights that you wouldn’t see from peer review. First of all, has a reviewer ever tried implementing an algorithm described in a paper? If you are a reviewer who has — I salute you. I doubt it’s very common. But when I come across a paper that is interesting for a problem I’m working on, I do try to implement it. If it gives me fits, I either abandon the method or try to contact the authors. This is simplified in a StackOverflow academic review setting, where the herd is giving this sort of feedback to the authors as a part of the review process. You can see how this level of communication would be beneficial. Inane non-expert commenters will either be filtered out (if they are truly inane) or they will shed light on confusing parts of your research presentation, allowing dissemination of your research to an even wider audience. This last thing is often given lip-service by the scientific community, but rarely have I seen actual attempts to do so.

So the next question is do we reinvent the wheel? Stack Overflow already has a community of smart people in place. Why don’t we just start using it?  Maybe SO could include some functionality for more research oriented questions.  All research can be viewed a set of questions.  Is this a good way of attacking this problem?  Is there a better way of doing it?  Is the methodology sound?  Isn’t my method the shiz?

Note: I’m fairly certain I’m not the first person making this call. I’m pretty sure I heard someone else recently make this point (maybe it was John Cook?) but i can’t find the reference.  Please comment.

Since I started blogging almost a year and a half ago, I have been following many blogs. I managed to find some blogs dealing with computational linguistics and natural language processing, but they were few and far between. Since then, I’ve discovered quite a few NLP people that have entered the blagoblag. Here is a non-exhaustive list of the many that I follow.

Many of these bloggers post sporadically and even then only post about CL/NLP occasionally. I’ve tried to organize the list into those who post exclusively on CL/NLP (at least as far as I have followed them) and those who post sporadically on CL/NLP. I would fall into the latter, since I frequently blog about my dogs, regular computer science-y and programming stuff, and other rants. P.S. I group Information Retrieval in with CL/NLP here, but only the blogs I actually read. I’m sure there’s a bazillion I don’t.

If I’ve missed one+, please let me know. I’m always on the lookout. Ditto if you think I’ve miscategorized someone.  I’ve excluded a few that haven’t posted in a while.

Git is a version control system that has been gaining in popularity recently.  If you have heard of or used Subversion or CVS, you are familiar with the basic principle of keeping track of changes by multiple users in a series of documents (source code, text files, etc).  One of the chief benefits of version control in software is that you can rollback in case the code has become corrupted.  It’s easy to see which changes were made where and broken code can be fixed more easily than if you had no version control and had to reconstruct the working code from scratch.  Unlike Subversion and CVS, Git is a distributed version control system.  Each user has their own copy of the entire repository and history.  Branching and merging is much easier and it’s extremely simple to get started.  Plus, having used all three, Git is the most fun.

Academic settings impose different constraints on code base management.  The goal is usually less about code quality and more about exploring possibilities.  Academic code is often quite shitty, hacked together by some grad student(s), with dozens of false starts and changes in requirements.  Trying to recreate previous experiments is often very difficult unless the grad student made previsions for such rollbacks.  And if they have, it’s probably done in a way that seemed logical to the grad student at the time but is a nightmare for someone new to the project.  There are ways to avoid this, by placing more of an emphasis on software engineering, but sometimes projects are so small or short-lived that it doesn’t seem feasible to trouble with that at first.  And if you don’t even have a clear picture of where you are heading, it might not even be possible (though you are probably doomed to many problems in that case).

To help combat these issues, I will contend that every academic software project must use version control.  Git makes that easy and here’s why.

1.  Creating the first repository is a no-brainer.

To create a new repository you simply type:

git init

It’s so easy, you can use it for anything.  To clone someone else’s repository, just type:

git clone git://location.of.origin.repository

Cloning is very similar to checking out in Subversion and CVS, except that you can now work completely independently if you desire.  And you can tunnel it through ssh (substitute ssh:// for git:// above), if you’re worried about security.

2.  You can still use it while off the grid.

In Subversion, creating the initial repository means needing some central place where all of the code goes.  If you are collaborating with several people, chances are this repository is not on your own machine so if you cannot access the network, you cannot access the repository.  With Git, you store the entire repository and history on your own machine so even if you are off the network, you can take advantage of all of the features of having version control.

3.  Branch your experiments.

Often the need arises to try out different approaches in academic coding.  Branching in Git is ridiculously simple:

git checkout -b new-branch-name

You can easily switch between multiple branches, merge branches, or discard them.  One approach might be to keep the main architecture stuff in your master branch (the original) and use branches for different parameters in experiments.  This will let you easily and logically separate functionality so that running an old experiment is just a matter of checking out the branch that pertained to it.  Update:  Thanks to Dustin  Sallings for the shorter version of checking out a new branch.

4.  Version control your paper.

Why use a shared folder or email to edit your paper?  You can easily create a Git repository to collaborate and merge changes.  You can quickly see who contributed what to a paper.  Dario Taraborelli wrote about this a few months ago, though his point was that you would need your collaborators to be familiar with a version control system and they usually aren’t.  I am arguing that they should be.  On a side note, another VCS, Bazaar, is listed as an alternative in the comments to Dario’s post.

5.  Convert into an open source project.

Sourceforge has been around for a while, but the UI is absolute garbage.  There is an even better solution out there:  GitHub.  GitHub is free for open source projects and offers some great visualizations for helping you track the life of your open source project.  Of course, there is Google Code, which is quite nice and easy to use.  It doesn’t support Git, just Subversion.  The drawback to using Google Code is that you have a lifetime max of 10 open source projects.  No such limit with GitHub.  Moving your Git repository to GitHub is also a simple matter of forking your project to GitHub.

Why does this even matter?  Check out Ted Pedersen‘s Empiricism is not a matter of faith [pdf] in the September issue of Computational Linguistics.  He contends that you should create academic software with the goal of releasing it.  This ensures the survivability of your project, increases the impact of your work, and allows reproducibility of your results.  Git makes that easier, n’est-ce pas?

6.  Keep track of your grad students.

Suspect your grad students are slacking?  Check the commit logs!  And now I prepare for hate mail from grad students.  However, I think that if I had this form of accountability, it would have made me more productive.  Of course, you don’t need Git for this, any version control system would do.  Of all the systems I’ve used, Git’s presentation of changes is the user-friendliest.

7.  Version control helps you write the paper.

When it comes time to write the paper, the version control logs can be used to provide a roadmap of what you have done.  Even though you probably have kept good notes, version control keeps a calendar of events that can add useful perspective (or fill in gaps when your notes are inadequate).

8.  Git is faster and leaner than other version control systems.

Because you have the complete repository on your own system, most operations are much faster in git.  Git reports an order of magnitude improvement in speed for some operations.  Git has a packed format they report uses less storage in most circumstances, as well.  Git has been reported to be almost three times more space efficient than Bazaar, another distributed version control system mentioned above.  Git also features an easy binary search when trying to locate bugs.

9.  Version control makes it easier to bring new team members up to speed.

Speaking from experience, having a record of commits (and well documented commits) makes it easier to come up to speed on an existing project.  This applies not only to academic coding but to any coding endeavor.  Good documentation doesn’t hurt either.

10.  Save yourself some headaches.

I think you’ll minimize headaches if you use Git.  If not Git, at least use some version control system.  A lot of the things I listed above are covered by most version control systems, but Git combines regular advantages of version control in a way that is very friendly to non-linear coding situations.  Git also makes it a cinch to move your code into an open source project that can have a significant impact on your career as a researcher.  And Git is so easy to use, you have to ask yourself, why not?

This is research I did a while ago and presented Monday to fulfill the requirements of my Masters degree.  The presentation only needed to be about 20 minutes, so it was a very short intro.  We have moved on since then, so when I say future work, I really mean future work.  The post is rather lengthy, so I have moved the main content below the jump.


Today is the official opening day of GWAP: Games with a Purpose. This is one of two research projects I have been working on for the past few months, though my involvement with GWAP so far has only been in the form of attending meetings, minor testing, and offering my sage gaming advice (and by sage, I mean the herb). GWAP is the next phase in Luis von Ahn‘s human computation project. If you visit and play some games, not only will you be rewarded with a good time, but you’ll be helping science! Science needs you. To play games. Now.

The Idea

Artificial intelligence has come a long way, but humans are still far better at computers at simple, everyday tasks. We can quickly pick out the key points in a photo, we know what words mean and how they are related, we can identify various elements in a piece of music, etc. All of these things are still very difficult for computers. So why not funnel some of the gazillion hours we waste on solitaire into something useful? Luis has already launched a couple websites that let people play games while solving these problems. Perhaps you’ve noticed the link to Google Image Labeler on Google Image Search? That idea came from his ESP game (which is now on GWAP).

The Motivation

What researchers need to help them develop better algorithms for computers to do these tasks is data. The more data the better. Statistical machine translation has improved quite a bit over the past few years, in large part due to an increased amount of data. This is the reason why languages that are spoken by few people (even those spoken by as few as several million) still don’t have machine translation tools: there is just not enough data. More data means more food for these algorithms which means better results. And if results don’t improve, then we have learned something else.

The Solution

Multiple billions of hours are spent each year on computer games. If even a small fraction of that time were spent performing some task that computers aren’t yet able to do, we could increase the size of the data sets available to researchers enormously. Luis puts this all a lot better than I can, and fortunately, you can watch him on YouTube (below).

So, check it out already.