Posts Tagged ‘java’

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

Java maps and sorting

Posted: 1 August 2009 in Uncategorized
Tags: , , , , ,

I’m always a little annoyed I have to implement sorting Map keys by their values myself in Java.  It seems like they should be a part of the standard Collections library or something.  Maybe they are and I just haven’t seen it?  My solution (gist) is based on feedback from Josh in the comments to a previous post. How does that look to you?

Fun with trees in Ruby

Posted: 20 November 2008 in Uncategorized
Tags: , , , , , , , ,

Like Java and unlike Python, Ruby does not support multiple inheritance.  Also there is no explicit way to create an interface.  One way Ruby lets you get around both problems is by allowing you to include a module in a class.  It’s not quite the same, but with the proper planning you can duplicate the functionality.  Of course, one question you should always ask yourself when trying to shoehorn something from one language into another is if you’re really going about it the right way.

One way of implementing a Java-like interface in Ruby is by creating a module containing the skeleton functions you want the implementing class to implement.

module A
  def method1() raise "not supported"; end

class B
  include A
  def method1
    puts "now implemented"

Presto, module A is basically a Java interface.  Of course, whether a method has been implemented is not checked until runtime when the method is actually called.  Also if you mix in implemented methods alongside the interface methods, you have something very like an abstract class (minus the compile-time checking).

This came up when I was implementing a bunch of simple tree functions like finding the siblings of a node, finding the grandparent, the descendants, the leaves of a subtree, etc.  It seemed like these were things that should be implemented already.  And why not?  So I threw all of those methods into a module and made it like a Java abstract class.  All you have to implement is a method to call the parent of the current node (or return nil if there is none) and a method to get an Array of the children of the current node.  Your class can pull children from a database, a file, something more complex — it doesn’t matter.  Just implement those two methods and drop in the SimpleTree module and problem solved.

Since I’ve been having fun with gems, I made one for this and slapped it up on GitHub.  To get it, just type:

sudo gem install ealdent-simple-tree

Assuming that you have already done this as some point in the past:

gem sources -a

Feel free to extend it, modify it, contribute to it, etc. I’m using the BSD license, which is my current favorite.

I’ve begun learning ruby for my new job, a language that doesn’t seem to have really gotten any traction in the NLP community (at least not that I’ve heard).  I had been using python for my NLP stuff (homework and projects) and Java for my recommender system stuff.  In retrospect, I could have used python for the recommender stuff, but I wasn’t aware of some speed-ups so resorted to Java.  Of course, the recommender stuff isn’t strictly NLP.  Ruby is just as well suited as python and seems a lot better than Java for many tasks (though Java certainly has its place).  At the very least, a scripting language like ruby or python is great for prototyping.  It’s easy to test new ideas quickly.

I was reading through Pang et al (2002), which deals with classifying movie reviews as positive or negative.  They look at three machine learning approaches:  Naive Bayes, Maximum Entropy classifier and Support Vector Machines.  This seemed like a good opportunity to try out my nascent ruby skills, since it’s the kind of crap I can roll together in python in short order (and do all the time).  So I downloaded the data for the paper (actually I downloaded the later data from the 2004 paper).  There are 1000 positive and 1000 negative movie reviews.  The task is to train a classifier to determine whether a review expresses a positive opinion (the author liked the movie) or a negative opinion (the author did not like the movie).  I chose to just use SVMs since they do best for this task according to the paper, they do really well for text categorization, and they are easy to use and download.

The results were quite nice.  Ruby turned out to be just as handy as python at manipulating text and dealing with crossfold validation:  the two main “challenges” in implementing this paper.  I used tf-idf for weighting the features and thresholded document frequency to discard words that didn’t appear in at least three reviews.  The result was that I achieved about 85.7% accuracy using the same cross validation setup described in their followup work (Pang and Lee, 2004).  In other words, the classifier could correctly guess the opinion orientation of reviews as positive or negative nearly 86% of the time.

Pang et al (2002) discussed some of their errors and hypothesized that discourse analysis might improve results, since reviewers often use sarcasm.  There’s also the case where authors use a “thwarted expectations” narrative.  This offered me one of the few chuckles I’ve ever had while reading a research paper:

“I hate the Spice Girls. … [3 things the author hates about them] …  Why I saw this movie is a really, really, really long story, but I did and one would think I’d despise every minute of it.  But… Okay, I’m really ashamed of it, but I enjoyed it.  I mean, I admit it’s a really awful movie …the ninth floor of hell… The plot is such a mess that it’s terrible.  But I loved it.”


Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  “Thumbs Up?  Sentiment Classification Using Machine Learning Techniques.”  In Proceedings of the ACL 02 conference on Empirical Methods in Natural Language Processing – Volume 10, July 2002. [pdf]

Bo Pang and Lillian Lee.  “A Sentimental Education:  Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.”  In Proceedings of the ACL, 2004. [pdf]

So I am on the market after getting my masters.  I’ve posted my resume to Dice and Monster and a couple others.  Monster gets the most unsolicited calls.  I’m finding that recruiters are an odd lot.  There are some who are pleasant, though to a man (or woman) they’ve never heard of NLP or computational linguistics and have no idea how to help me (with the exception of the one or two recruiters I’ve contacted for NLP jobs).  For the most part, they seem to not even read my resume.  Oh, you have Java skills?  How about this Java grunt job that only requires a bachelor’s degree? Waste of time.  The best are the ones who contact me in broken English with a multitude of typos.  Yeah, right.

I have been told that with my CMU degree, I should be looking exclusively at the big corporations:  Google, Microsoft, Amazon, Yahoo, etc.  If I do my time there, I can get a job anywhere and have a good career.  That’s true, I’m sure.  Something about startups is really attractive to me, though, so I’ve been looking at a lot of them.  What if the only job I can get at a Googlosoftazonahoo is not NLP-related?  Everything is so rushed.  I have a September 1st exit date for CMU and I want to be in the city of my chosen job by then.  Add lease problems.  The problem is my decision to abandon academia didn’t come at the right time:  back in the winter.  I am, however, more confident than ever that it was the right decision.

Just what value is there in getting a degree in Computer Science (CS)? Are new graduates competent programmers? Is that the purpose of a CS degree? Should companies be spending money to train new hires out of college in the programming languages and practices that they use?

Robert Dewar is a professor emeritus at NYU in computer science, and he believes that the status of software engineers in America is in danger due to general incompetence of new graduates. The long and the short of it is that after the dot-com bubble burst, and computer science enrollment at universities plummeted, schools restructured their programs to be more fun. Essentially, they were dumbed down. Specifically, the focus has shifted away from math and the theory of computation. Students are not taught a wide range of programming practices, but instead are trained to rely on large software libraries in a sort of “cookbook” approach. That is, students can assemble a solution to a known problem (in Java), but they are woefully undertrained for solving actual problems in the wild with “more practical” programming skills.


Recommended Reading

Posted: 23 December 2007 in Uncategorized
Tags: , , , , , ,

I think this should be required reading for any novice programmer and probably even more so for established programmers. Agree with him or not, I think you’ll agree that Steve Yegge has some interesting things to say. My favorite quote:

“Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.”

This is especially interesting to me as I just jumped on the IDE bandwagon.  I received a few interesting comments  on that post that are worth reading.  A minor theme was the fact that you just can’t handle a massive code base without some kind of IDE (Integrated Development Environment).  I have worked with a code base of about 20,000 lines of Java with no IDE and there were certainly challenges.  I have also worked with a code base of over 100k lines of C (not ++) and that was a pain in the butt.  Massive changes took me days to complete and then weeks to debug.  Having an IDE would have made it easier, but it also would have made it much larger.  It is so easy to bloat up code with every kind of get/set method and constructor there is, but many of them are never used.  Is that a bad thing or just good future planning?  There is definitely a trade off, and one that probably comes down on the side of bad thing more often than not.

In any case, it’s something I have to keep in mind as I go forward with my new project.