Posts Tagged ‘python’

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

John Cook just brought up the changeover from Scheme to Python in MIT’s beginning CS classes. I was exposed to Scheme very early in my programming career during my ill-fated quarter at the University of Chicago.  For some reason I can’t remember (it was 14 years ago), I registered late and couldn’t get into entry level CS classes.  So I enrolled in an AI class (against the advice of my undergrad advisor) without really knowing how to program. This was old school AI, not machine learning, so it wasn’t the maths that got me. The first programming assignment threw me completely for a loop — I had never seen Scheme before and didn’t know a thing about it. My world up to point that had consisted of Pascal and BASIC, with a smattering of assembly.  The logic behind the AI stuff made sense, but the logistics of getting Scheme to do what I wanted escaped me and I dropped the class.  Turns out that advisor was worth listening to!

Whenever something like this happens, you will see three groups of commenters emerge.  First are the I-don’t-care’s.  Actually, you don’t see them since they don’t give a crap.  The next are the fanboys.  They love the new language and are glad that MIT has discarded a dinosaur in favor of the language of Heaven.  And finally you have the sticks in the mud who lament the death of computer science because a whole generation will grow up retarded thanks to not learning programming just the way they did.  Obviously, these are exaggerated — I say it to shock the mind.

Cognitive psychology would have me believe that by drawing stark lines and exaggerating the situation, I will actually cause people to align themselves more closely with the stereotypes I laid out.  The logical alternative would be to view it as a joke, take a step back, and examine your own reaction.  Why do people get so worked up about this?  Why do I get so worked up about people getting so worked up?  :P

Maybe I’m getting crotchety in my old age.

I just spent the day with a couple of friends at the Google App Engine Hackathon in Atlanta.  We got to see Google Atlanta – or the public part of it anyway.  We weren’t permitted in the cafeteria or in the actual office area, which would have required signing non-disclosure agreements.  The office was about what I expected — the Google colors were in abundance, there were giant bouncing balls, and free drinks! (non-alcoholic)

We spent the day in a fairly hot conference room hacking away on a variety of projects.  We set up teams beforehand to work on projects that people proposed and I chose to work on a variation of a computing puzzles site, dubbed LangWar.  The idea is fairly simple in the early stages:  people submit programming puzzles and other people post their solutions in code form.  You can vote which questions you like and which answers you like (or dislike).  You can also leave comments on questions and answers.  The result of the ratings is that the best questions will be counted higher, in a method similar to Reddit, and the best answers will trickle to the top based on the votes of users.

This is very similar to Stack Overflow, but different in that it is intended to be more of a puzzle solving site that pits implementations in different programming languages against each other.  It’s sort of a battle royale of programming languages – thus the name, LangWar.  It’s more of an enhanced version of Project Euler, where people can vote on the questions and answers.

In any case, it was a great chance to get my hands dirty in Google App Engine, meet some Atlanta python coders, and have fun.  It’ll be interesting to see where LangWar goes from here, if it does go anywhere.

I’ve begun learning ruby for my new job, a language that doesn’t seem to have really gotten any traction in the NLP community (at least not that I’ve heard).  I had been using python for my NLP stuff (homework and projects) and Java for my recommender system stuff.  In retrospect, I could have used python for the recommender stuff, but I wasn’t aware of some speed-ups so resorted to Java.  Of course, the recommender stuff isn’t strictly NLP.  Ruby is just as well suited as python and seems a lot better than Java for many tasks (though Java certainly has its place).  At the very least, a scripting language like ruby or python is great for prototyping.  It’s easy to test new ideas quickly.

I was reading through Pang et al (2002), which deals with classifying movie reviews as positive or negative.  They look at three machine learning approaches:  Naive Bayes, Maximum Entropy classifier and Support Vector Machines.  This seemed like a good opportunity to try out my nascent ruby skills, since it’s the kind of crap I can roll together in python in short order (and do all the time).  So I downloaded the data for the paper (actually I downloaded the later data from the 2004 paper).  There are 1000 positive and 1000 negative movie reviews.  The task is to train a classifier to determine whether a review expresses a positive opinion (the author liked the movie) or a negative opinion (the author did not like the movie).  I chose to just use SVMs since they do best for this task according to the paper, they do really well for text categorization, and they are easy to use and download.

The results were quite nice.  Ruby turned out to be just as handy as python at manipulating text and dealing with crossfold validation:  the two main “challenges” in implementing this paper.  I used tf-idf for weighting the features and thresholded document frequency to discard words that didn’t appear in at least three reviews.  The result was that I achieved about 85.7% accuracy using the same cross validation setup described in their followup work (Pang and Lee, 2004).  In other words, the classifier could correctly guess the opinion orientation of reviews as positive or negative nearly 86% of the time.

Pang et al (2002) discussed some of their errors and hypothesized that discourse analysis might improve results, since reviewers often use sarcasm.  There’s also the case where authors use a “thwarted expectations” narrative.  This offered me one of the few chuckles I’ve ever had while reading a research paper:

“I hate the Spice Girls. … [3 things the author hates about them] …  Why I saw this movie is a really, really, really long story, but I did and one would think I’d despise every minute of it.  But… Okay, I’m really ashamed of it, but I enjoyed it.  I mean, I admit it’s a really awful movie …the ninth floor of hell… The plot is such a mess that it’s terrible.  But I loved it.”


Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  “Thumbs Up?  Sentiment Classification Using Machine Learning Techniques.”  In Proceedings of the ACL 02 conference on Empirical Methods in Natural Language Processing – Volume 10, July 2002. [pdf]

Bo Pang and Lillian Lee.  “A Sentimental Education:  Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.”  In Proceedings of the ACL, 2004. [pdf]

A couple of days ago, I wrote a script that would tweet anything you plurked. Thanks to some code from Neville Newey (based on PHP code by Charl van Niekerk), the script I wrote has been updated to both plurk your tweets and tweet your plurks. This should work on both windows and linux machines. If you have access to a linux machine, I suggest setting up a cron job to take care of this. As I mentioned in the previous post, if you set up a cron job, be sure to change the path to plurkdb.dat to an absolute path. I have done the most testing on this with python 2.4 in linux.

This code is open source under the Creative Commons 3.0 Attribution license that this blog uses Creative Commons BSD license. Neville’s code appears to be under CC:Attribution 2.5 for South Africa, by what I could glean from his site. I have considered making this an open source project under Google code but have yet to take it all the way. Google sets a lifetime limit of 10 projects, so I will continue to hoard those against future need. If you make modifications to the code, please let me know and I will probably post them here and in the code for future releases, so we all win.

Note that the command line parameters have changed: <twitter username> <twitter password> <plurk username> <plurk password>

And of course, as with all software, use at your own risk.

Tweet your plurks

Posted: 2 June 2008 in Uncategorized
Tags: , , , , , , , ,

If you want to use Plurk, but aren’t ready to leave Twitter, I wrote a little python script you can use to automatically mirror your plurks on Twitter. This will not work for response plurks, but your main plurks will be extracted and posted to your Twitter account with the prefix “plurking:” followed by your plurk.

The resulting tweet looks like this:

sample of what the script outputs in twitter

Download the script and set it up as a cron job (or you could execute it manually). It should work with python 2.4 and later. It stores a plurkdb.dat file (which you should probably assign an absolute path to, depending on the behavior of cron on your system). This file is checked every time it is run to make sure that duplicate plurks aren’t being tweeted. You should pass the following parameters on the command line (or modify the script so they are hardcoded, if you want): <twitter username> <twitter password> <plurk username> <plurk password>. Update: see later post on updated plurk script.  And like with all software, use at your own risk.

Please let me know if you have any problems with it or see room for improvement. I hacked this out in a hurry, so …

So I decided to finally fart around with OpenCalais a little. There’s a nice video on the site that gives you an impression of what it is capable of, but it’s also like all videos about software: propaganda. Calais is basically Named Entity Recognition (NER) software that can be accessed via a web API. Whereas a regular NER system might recognize named entities like people, organizations, and places, Calais also recognizes relationships like corporate acquisitions. To be a little more clear if you aren’t familiar with NER, it is basically the task of identifying the proper nouns in a body of text. Named entities aren’t always proper nouns, but that is one starting point. Examples would be: John Hancock (Person), New York (Place), and Apple (Organization). Calais recognizes relationships, which means we get an extra layer of information: Acquisition(Microsoft, Yahoo!).

Calais is put out by Reuters which has a long history of helping out the NLP and IR research communities with data sets. Being Reuters, the data sets are all newswire stuff, and Calais is produced in that spirit. Currently the relationships and named entities available reflect that bias, but the list is expanding and it is probably flexible enough for most domains. Their claim is that with each new release, there will be additional entities and relationships available. Also, the software is completely open source free for commercial and private use. For this, I give Reuters props.

OpenCalais uses SOAP or HTTP post to issue requests and you can take a look at their tutorials for exactly how to use it. After some very shallow digging on the googles, I found an open source project called python-calais, which is basically just a script that wraps some text and sends it to the Calais service, then processes the output. The output is in RDF (resource description framework), which is a type of xml document that is not very friendly to the human eye but is nice and powerful otherwise. The python-calais script uses an rdf library for python, so you’ll need to download that if you don’t already have it.

Running it on my most popular post, you get the following output:

93B6642D-0D7C-37Ab-A92F-66Ebfef13C8D :: Recommender Systems (Industryterm)
0Dccb106-442A-3848-Bd0B-A388E73F4C8C :: Chris Sternal-Johnson (Person)
Aab0D16A-Ad5A-348A-A8Dc-58Cf59A1Bc15 :: Kristina Tikhova (Person)
42F476A0-2Fae-3F36-808D-803E4F620Ab0 :: Java (Technology)
6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
4003D863-C7A6-3E6F-8E3C-0913Bf2F8242 :: National Aeronautics And Space Administration (Organization)
77D1Ceb3-9900-3Dd7-8351-F29408B21412 :: Carnegie Mellon University (Organization)
Ee58Ef4B-1C98-3F8B-Aff8-3Fd6E3D76A9E :: Wonderful Site (Industryterm)
8F12E551-A8F1-3705-866C-D44D1A6A54F4 :: Richard M. Hogg (Person)
Adee23De-B1B0-37Ad-9E20-1Fa8094F6D39 :: Steel (Industryterm)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
B6A8Dbfa-Fd35-32Bb-9E05-A2811C480000 :: Mike Tan (Person)
Ed8B5Fe4-616A-36Ea-8C47-3Eea7C71Aee0 :: Ben Eastaugh (Person)
D3Bcba58-00Fc-33C5-9346-Dbf6A2441867 :: Machine Learning (Technology)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)
38116E8D-F8B4-3D03-B0Ad-C9A24B888E61 :: Jason M. Adams (Person)
4386B07C-F6B8-3991-Af74-Ab11A951F0Ee :: David Petar Novakovic (Person)
Aa14303F-F9F0-31B8-Adff-3B9C68E0A9F1 :: Language Technologies Institute (Organization)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)

As it picks up everything on the page, there is a lot included there that isn’t related to the post about Old English translation. Also, it picks up some weird so-called industry terms like “steel.” If you filter out just the text (manually), the output is a little more sensible:

6C4Cd5D9-5866-35B5-81Ab-B8A5C1751A44 :: Pre-Processing Phase (Industryterm)
Ca1E4Eb7-7820-3862-8443-26E37B33E13F :: Machine Translation (Technology)
0Ace00C6-2B9F-32C2-8949-82A0F6C6B444 :: Xml (Technology)
2Ed2F085-1C63-324E-B518-60332388E273 :: Norman French (Person)
136157D8-D62E-3C55-Ae67-3Ec182C2C703 :: Phil Barthram (Person)
F17C3779-3810-3Ff9-A42D-75C3137F0F7F :: Modern English (Person)

(The codes are unique identifiers.) Unfortunately, some important terms are still missed, like Old English. So it appears Calais has some growing to do, but it’s off to a good start. Part of the problem might be that that blog post is out of domain. I imagine with time, it will continue to improve. We’ll see.