Posts Tagged ‘stemmers’

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.


A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]