Posts Tagged ‘latent dirichlet allocation’

works-on-my-machine-starburstA while back I ported David Blei’s lda-c code for performing Latent Dirichlet Allocation to Ruby.  Basically I just wrapped the C methods in a Ruby class, turned it into a gem, and called it a day.  The result was a bit ugly and unwieldy, like most research code.  A few months later, Todd Fisher came along and discovered a couple bugs and memory leaks in the C code, for which I am very grateful.  I had been toying with the idea of improving the Ruby code, and embarked on a mission to do so.  The result is a hopefully much cleaner gem that can be used right out of the box with little screwing around.

Unfortunately, I did something I’m ashamed of.  Ruby gems are notorious for breaking backwards compatibility, and I have done just that.  The good news is, your code will almost work, assuming you didn’t start diving into the Document and Corpus classes too heavily.  If you did, then you will probably experience a lot of breakage.  The result, I hope is a more sensical implementation, however, so maybe you won’t hate me.  Of course, I could be wrong and my implementation is still crap.  If that’s the case, please let me know what needs to be improved.

To install the gem:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

Enjoy!

Reblog this post [with Zemanta]

Advertisements

Since Ruby is my new favorite toy, I thought it would be fun to try my hand at C extensions.  I came across David Blei’s C code for Latent Dirichlet Allocation and it looked simple enough to convert into a Ruby module.  Ruby makes it very easy to wrap some C functions (which is good to know if you need a really fast implementation of something that gets called alot).  Wrapping a C library is slightly harder, but not horribly so.  Probably most of my challenge was the fact that it’s been so long since I wrote anything in C.

Since the code is open source, I decided to release the Ruby wrapper as a gem on GitHub.  I chose GitHub over RubyForge, because it uses Git and freakin’ rocks.  But GitHub is a story for another day.  Feel free to contribute and extend the project if you’re so inclined.

A basic usage example:

require 'lda'
# create an Lda object for training
lda = Lda::Lda.new
corpus = Lda::Corpus.new("data/data_file.dat")
lda.corpus = corpus
# run EM algorithm using random starting points
lda.em("random")
lda.load_vocabulary("data/vocab.txt")
# print the topic 20 words per topic
lda.print_topics(20)

You can also download the gem from GitHub directly:

gem sources -a http://gems.github.com
sudo gem install ealdent-lda-ruby

You only need the first line if you haven’t added GitHub to your sources before.

Latent Dirichlet Allocation (LDA) is an unsupervised method of finding topics in a collection of documents.  It posits a set of possible topics from which a subset are selected for each document.  This selected mixture of topics represents the topics discussed in the document, and each word in the document is generated by this mixture.  As a quick example, if we had a short document with the topics geology and astronomy:

The rover traveled many millions of miles through space to arrive at Mars. Once there, it collected soil samples and examined them to determine if liquid water had ever been present on the surface.

In this case, the topic astronomy is represented in red and geology in green.  LDA finds these latent topics in an unsupervised fashion using the EM algorithm.  EM is a two step process for estimating parameters in a statistical model.  The nice thing about it is that it’s guaranteed to converge to a local maximum (not necessarily the global!).  However, it can take a while to converge, depending on the size and nature of the data and model.  While I was in school, EM was one of the most confusing concepts, and I’m still not 100% on it, but it makes a lot more sense now than it did before.

In the context of LDA, EM is basically doing two things.  First, we come up with an idea about how the topics are distributed.  Next, we look at the actual words and compute the probabilities in the model based on those hypothesized topics.  Eventually we converge to a local “best” set of topics.  These may not correspond to realistic topics, but they maximize the negative log probability of the model.  Usually LDA does a pretty good job of finding explainable topics given a decent amount of data.

For more details about LDA, check out the paper by Blei et al (2003).  LDA has been extended in a number of different directions since the original paper, so it’s essential reading if you’re doing any sort of topic modeling [citation needed].

References

D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022. [pdf]