Posts Tagged ‘github’

Github just announced their own version of the Netflix Prize.  Instead of predicting movie ratings, Github wants you to suggest repositories for users to watch.  This is different from the Netflix Prize in a number of ways:

  1. a user watching a repo is similar to a user visiting a page from a search engine – they are implicit endorsements (we assume that doing so means the user actually likes the repo)
  2. we are predicting the likelihood of a user wanting to watch a repo (binary event), rather than how much a user likes a movie
  3. the data set is a lot smaller, and sparsity is a LOT greater (the matrix is 0.006% filled vs. Netflix 1% filled)
  4. you get multiple tries!  they let you pick 10 repos that user may watch and as long as one of them matches, you get credit for it

Already there have been many submissions.  The number one place is currently held by Daniel Haran with 46.9% guessed correctly.  Happy hunting, if you decide to compete.

The prizes are a bottle of Pappy van Winkle bourbon and a large Github account for life.  The bottle of Pappy is making me consider competing.

Advertisements

A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]

Jekyll and Code

Posted: 8 January 2009 in Uncategorized
Tags: , , , , , , ,

Tom Preston-Werner, aka mojombo, rocks.  When GitHub announced GitHub Pages recently, they pointed to a new blog engine, Jekyll.  Jekyll generates the blog as a set of static pages — no database reads, no PHP, just fast HTML.  I was instantly drawn to it, and since I’ve been itching to switch blog engines, I damn near moved this blog.  It would be hosted on GitHub, for free.  And it would be backed up using my favorite version control system.  I would have complete access to all of my content.  If WordPress went belly up, I would lose all of my content.  That bothers me.

Jekyll is still in its infancy.  But for two things, I would switch right now.  First, support for tags is incomplete, so pages on my blog such as http://mendicantbug.com/category/computational-linguistics/ would no longer be supported under Jekyll.  That would play hell with my Google traffic.  I’m willing to make that sacrifice since most of that traffic is from people who don’t care about the main topics I’m interested in.  Second, and this is the killer, Jekyll does not support comments.  Yet.  The good news is, it can be forked and someone may implement comments.  I hope so, but the static nature of Jekyll means handling comments is not very straightforward.  I can imagine how it might be done, so we’ll see.  I suppose I could do it myself, but my plate is so full right now I’m having a hard time getting what I need to get done done.

So what I’m doing instead, for now, is hosting my code there.  Jekyll has code highlighting built-in using Liquid.  Handy!  I put up the source for my post on Bandwidth simulation.  I’ll be adding more soon, which I’ll make note of, if for some reason you’re actually interested.

Git is a version control system that has been gaining in popularity recently.  If you have heard of or used Subversion or CVS, you are familiar with the basic principle of keeping track of changes by multiple users in a series of documents (source code, text files, etc).  One of the chief benefits of version control in software is that you can rollback in case the code has become corrupted.  It’s easy to see which changes were made where and broken code can be fixed more easily than if you had no version control and had to reconstruct the working code from scratch.  Unlike Subversion and CVS, Git is a distributed version control system.  Each user has their own copy of the entire repository and history.  Branching and merging is much easier and it’s extremely simple to get started.  Plus, having used all three, Git is the most fun.

Academic settings impose different constraints on code base management.  The goal is usually less about code quality and more about exploring possibilities.  Academic code is often quite shitty, hacked together by some grad student(s), with dozens of false starts and changes in requirements.  Trying to recreate previous experiments is often very difficult unless the grad student made previsions for such rollbacks.  And if they have, it’s probably done in a way that seemed logical to the grad student at the time but is a nightmare for someone new to the project.  There are ways to avoid this, by placing more of an emphasis on software engineering, but sometimes projects are so small or short-lived that it doesn’t seem feasible to trouble with that at first.  And if you don’t even have a clear picture of where you are heading, it might not even be possible (though you are probably doomed to many problems in that case).

To help combat these issues, I will contend that every academic software project must use version control.  Git makes that easy and here’s why.

1.  Creating the first repository is a no-brainer.

To create a new repository you simply type:

git init

It’s so easy, you can use it for anything.  To clone someone else’s repository, just type:

git clone git://location.of.origin.repository

Cloning is very similar to checking out in Subversion and CVS, except that you can now work completely independently if you desire.  And you can tunnel it through ssh (substitute ssh:// for git:// above), if you’re worried about security.

2.  You can still use it while off the grid.

In Subversion, creating the initial repository means needing some central place where all of the code goes.  If you are collaborating with several people, chances are this repository is not on your own machine so if you cannot access the network, you cannot access the repository.  With Git, you store the entire repository and history on your own machine so even if you are off the network, you can take advantage of all of the features of having version control.

3.  Branch your experiments.

Often the need arises to try out different approaches in academic coding.  Branching in Git is ridiculously simple:

git checkout -b new-branch-name

You can easily switch between multiple branches, merge branches, or discard them.  One approach might be to keep the main architecture stuff in your master branch (the original) and use branches for different parameters in experiments.  This will let you easily and logically separate functionality so that running an old experiment is just a matter of checking out the branch that pertained to it.  Update:  Thanks to Dustin  Sallings for the shorter version of checking out a new branch.

4.  Version control your paper.

Why use a shared folder or email to edit your paper?  You can easily create a Git repository to collaborate and merge changes.  You can quickly see who contributed what to a paper.  Dario Taraborelli wrote about this a few months ago, though his point was that you would need your collaborators to be familiar with a version control system and they usually aren’t.  I am arguing that they should be.  On a side note, another VCS, Bazaar, is listed as an alternative in the comments to Dario’s post.

5.  Convert into an open source project.

Sourceforge has been around for a while, but the UI is absolute garbage.  There is an even better solution out there:  GitHub.  GitHub is free for open source projects and offers some great visualizations for helping you track the life of your open source project.  Of course, there is Google Code, which is quite nice and easy to use.  It doesn’t support Git, just Subversion.  The drawback to using Google Code is that you have a lifetime max of 10 open source projects.  No such limit with GitHub.  Moving your Git repository to GitHub is also a simple matter of forking your project to GitHub.

Why does this even matter?  Check out Ted Pedersen‘s Empiricism is not a matter of faith [pdf] in the September issue of Computational Linguistics.  He contends that you should create academic software with the goal of releasing it.  This ensures the survivability of your project, increases the impact of your work, and allows reproducibility of your results.  Git makes that easier, n’est-ce pas?

6.  Keep track of your grad students.

Suspect your grad students are slacking?  Check the commit logs!  And now I prepare for hate mail from grad students.  However, I think that if I had this form of accountability, it would have made me more productive.  Of course, you don’t need Git for this, any version control system would do.  Of all the systems I’ve used, Git’s presentation of changes is the user-friendliest.

7.  Version control helps you write the paper.

When it comes time to write the paper, the version control logs can be used to provide a roadmap of what you have done.  Even though you probably have kept good notes, version control keeps a calendar of events that can add useful perspective (or fill in gaps when your notes are inadequate).

8.  Git is faster and leaner than other version control systems.

Because you have the complete repository on your own system, most operations are much faster in git.  Git reports an order of magnitude improvement in speed for some operations.  Git has a packed format they report uses less storage in most circumstances, as well.  Git has been reported to be almost three times more space efficient than Bazaar, another distributed version control system mentioned above.  Git also features an easy binary search when trying to locate bugs.

9.  Version control makes it easier to bring new team members up to speed.

Speaking from experience, having a record of commits (and well documented commits) makes it easier to come up to speed on an existing project.  This applies not only to academic coding but to any coding endeavor.  Good documentation doesn’t hurt either.

10.  Save yourself some headaches.

I think you’ll minimize headaches if you use Git.  If not Git, at least use some version control system.  A lot of the things I listed above are covered by most version control systems, but Git combines regular advantages of version control in a way that is very friendly to non-linear coding situations.  Git also makes it a cinch to move your code into an open source project that can have a significant impact on your career as a researcher.  And Git is so easy to use, you have to ask yourself, why not?

Fun with trees in Ruby

Posted: 20 November 2008 in Uncategorized
Tags: , , , , , , , ,

Like Java and unlike Python, Ruby does not support multiple inheritance.  Also there is no explicit way to create an interface.  One way Ruby lets you get around both problems is by allowing you to include a module in a class.  It’s not quite the same, but with the proper planning you can duplicate the functionality.  Of course, one question you should always ask yourself when trying to shoehorn something from one language into another is if you’re really going about it the right way.

One way of implementing a Java-like interface in Ruby is by creating a module containing the skeleton functions you want the implementing class to implement.

module A
  def method1() raise "not supported"; end
end

class B
  include A
  def method1
    puts "now implemented"
  end
end

Presto, module A is basically a Java interface.  Of course, whether a method has been implemented is not checked until runtime when the method is actually called.  Also if you mix in implemented methods alongside the interface methods, you have something very like an abstract class (minus the compile-time checking).

This came up when I was implementing a bunch of simple tree functions like finding the siblings of a node, finding the grandparent, the descendants, the leaves of a subtree, etc.  It seemed like these were things that should be implemented already.  And why not?  So I threw all of those methods into a module and made it like a Java abstract class.  All you have to implement is a method to call the parent of the current node (or return nil if there is none) and a method to get an Array of the children of the current node.  Your class can pull children from a database, a file, something more complex — it doesn’t matter.  Just implement those two methods and drop in the SimpleTree module and problem solved.

Since I’ve been having fun with gems, I made one for this and slapped it up on GitHub.  To get it, just type:

sudo gem install ealdent-simple-tree

Assuming that you have already done this as some point in the past:

gem sources -a http://gems.github.com

Feel free to extend it, modify it, contribute to it, etc. I’m using the BSD license, which is my current favorite.

Since Ruby is my new favorite toy, I thought it would be fun to try my hand at C extensions.  I came across David Blei’s C code for Latent Dirichlet Allocation and it looked simple enough to convert into a Ruby module.  Ruby makes it very easy to wrap some C functions (which is good to know if you need a really fast implementation of something that gets called alot).  Wrapping a C library is slightly harder, but not horribly so.  Probably most of my challenge was the fact that it’s been so long since I wrote anything in C.

Since the code is open source, I decided to release the Ruby wrapper as a gem on GitHub.  I chose GitHub over RubyForge, because it uses Git and freakin’ rocks.  But GitHub is a story for another day.  Feel free to contribute and extend the project if you’re so inclined.

A basic usage example:

require 'lda'
# create an Lda object for training
lda = Lda::Lda.new
corpus = Lda::Corpus.new("data/data_file.dat")
lda.corpus = corpus
# run EM algorithm using random starting points
lda.em("random")
lda.load_vocabulary("data/vocab.txt")
# print the topic 20 words per topic
lda.print_topics(20)

You can also download the gem from GitHub directly:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

You only need the first line if you haven’t added GitHub to your sources before.