Posts Tagged ‘statistics’

I just published the simple-random ruby gem, which is ported from C# code by John D. Cook.  You can view the source on github or install the gem via rubygems:

gem install simple-random

The gem allows you to sample from the following distributions:

  • Beta
  • Cauchy
  • Chi Square
  • Exponential
  • Gamma
  • Inverse Gamma
  • Laplace (double exponential)
  • Normal
  • Student t
  • Uniform
  • Weibull

Simple examples:

require 'rubygems'
require 'simple-random'

r =
r.uniform # => 0.127064087195322
r.normal(5, 1) # => 5.71972152940515

Latent Dirichlet Allocation (LDA) is an unsupervised method of finding topics in a collection of documents.  It posits a set of possible topics from which a subset are selected for each document.  This selected mixture of topics represents the topics discussed in the document, and each word in the document is generated by this mixture.  As a quick example, if we had a short document with the topics geology and astronomy:

The rover traveled many millions of miles through space to arrive at Mars. Once there, it collected soil samples and examined them to determine if liquid water had ever been present on the surface.

In this case, the topic astronomy is represented in red and geology in green.  LDA finds these latent topics in an unsupervised fashion using the EM algorithm.  EM is a two step process for estimating parameters in a statistical model.  The nice thing about it is that it’s guaranteed to converge to a local maximum (not necessarily the global!).  However, it can take a while to converge, depending on the size and nature of the data and model.  While I was in school, EM was one of the most confusing concepts, and I’m still not 100% on it, but it makes a lot more sense now than it did before.

In the context of LDA, EM is basically doing two things.  First, we come up with an idea about how the topics are distributed.  Next, we look at the actual words and compute the probabilities in the model based on those hypothesized topics.  Eventually we converge to a local “best” set of topics.  These may not correspond to realistic topics, but they maximize the negative log probability of the model.  Usually LDA does a pretty good job of finding explainable topics given a decent amount of data.

For more details about LDA, check out the paper by Blei et al (2003).  LDA has been extended in a number of different directions since the original paper, so it’s essential reading if you’re doing any sort of topic modeling [citation needed].


D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022. [pdf]

Likings Rankness

Posted: 4 May 2008 in Uncategorized
Tags: , , , , , ,

Mayhaps you have used the Facebook app Likeness. It’s a fluff app, but has wide appeal since it does two things most people like: easy quizzes and comparisons with our friends. The graphic design that went into the app is a bit low-scale, but it gets the job done. If you haven’t used it, the concept is simple. You are presented with a quiz topic, like “What’s your addiction?” You are then presented with ten items that you must rank in the order specified by the question page (usually most to least favorite, or whatever). Once you have ranked the ten items, you are shown a screen that easily allows you to goof up and spam all your friends. But after that, it produces some sort of similarity score between you and all your friends who have taken it. I’ve never had a similarity below 46% and never one above 98%.

But it got me thinking, how exactly are they measuring this similarity? (more…)

Fun with charts

Posted: 19 December 2007 in Uncategorized
Tags: , , , , ,

I just saw a post on Statistical Modeling dealing with some of the worst use of statistical graphics this year. Be sure to check it out. I’d have to say I agree with that assessment. The case deals with two pictures of a road during the Crimean War. In the first picture, there is an road covered in cannonballs. In the second, the road is clear. Errol Morris challenged his readers to figure out which picture came first. The correct answer is the clear road.

Morris uses pie charts and bar graphs to display the reasons people gave for their decisions. While colorful, these graphs are also meaningless. So given the data, I z-normalized the on choices and off choices (made it so their distributions had mean 0 and standard deviation 1). I used the same bar graph setup (except horizontal this time). Since I normalized each distribution, the actual quantity of voters one way or the other no longer really makes a difference. I am just comparing the relative preference by one side or the other for a given reason. This assumes that there is some significance to a person not choosing a particular reason, which may be incorrect.

Click to enlarge the graph if it’s not properly visible:

Errol Morris discusses data on people’s decisions about two photographs from the Crimean War.

So what I think my chart shows is that shadows are the worst feature to choose for correctly guessing which came first. People who focused on either the shelling or characteristics/artistic features were more likely to choose correctly.  The most confusing feature is the number and position of the balls.   Also confusing were practical concerns.  If I were going to train a support vector machine to classify images of this type, I would use the three features: shelling, characteristics/artistic and shadows.

So what do you think? Am I way off on trying to normalize these and make this kind of assessment? I am, after all, a statistics amateur.

The PISA (Program for International Student Assessment) test is administered to 15 year olds in industrialized countries every three years. The 2006 results were just released and show that US students are ranked 17th out of 30 in science and 24th in math. About 1.3% of students reached the highest level on the test overall with New Zealand and Finland having the most star pupils at 3.9%. [source (Note: may require free registration)] (more…)

Perfect Major?

Posted: 18 November 2007 in Uncategorized
Tags: , , , , , , ,

The intertubes are full of quizzes. Magazines like Cosmo have thrived on them for years. “Are you a good lover?” Websites like Tickle pretty much consist of nothing else (and I haven’t bothered beyond the odd quiz someone sends me). Tons of Facebook apps like Flixster (movies) and Harry Potter rely on them heavily. One of my google alerts is for linguistics and I saw some random 14-year-old dude‘s blog post about his perfect major according to this quiz. My results are below the jump.

So of course everyone with sense knows these quizzes are pretty much random. However, they also collect a vast amount of data. What they don’t collect (usually) is actual information about the people who take their quizzes. Imagine if at the end of a quiz there was a question or two about the actual truth of the thing the quiz is predicting. What kind of lover are you? Well just ask! If the result is similar to the quiz results, you can gauge how well your quiz is classifying people. It may not produce scientifically valid results but it does produce results that are better than nothing. (more…)

Linguistic Issues in Language Technology (LiLT) is a new open-access journal in computational linguistics. The journal will focus on techniques that bring linguistics back into language technologies (LT). LT currently focus a lot on statistical techniques and sometimes can ignore linguistic insight altogether, but the field is beginning to swing around from the purely statistical approach to one that takes linguistic insight into account and merges it with statistical methods.

Curious about what sort of credibility this journal would have, I browsed the editorial staff and found some pretty big hitters. Following are some of the names that stood out to me. Christopher Manning of Stanford wrote the textbook used in my Language and Statistics class. Kemal Oflazer was one of my previous professors, who was visiting CMU last year. He’s done a lot of work with finite state transducers for morphological analysis of Turkish, among other things. Mark Liberman and Aravind Joshi of the University of Pennsylvania are pretty well known and accomplished. Aravind Joshi came up with Tree Adjoining Grammar and both he and Martin Kay won the ACL Lifetime Achievement Award. Mark Steedman is the current president of the ACL (Association for Computational Linguistics). Jason Eisner has done a lot of work on applying statistics to linguistics approaches and advised one of my current professors, Noah Smith. Philip Resnick has done a lot with word alignment and statistical machine translation.