Posts Tagged ‘software’

A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]

Tweet your plurks

Posted: 2 June 2008 in Uncategorized
Tags: , , , , , , , ,

If you want to use Plurk, but aren’t ready to leave Twitter, I wrote a little python script you can use to automatically mirror your plurks on Twitter. This will not work for response plurks, but your main plurks will be extracted and posted to your Twitter account with the prefix “plurking:” followed by your plurk.

The resulting tweet looks like this:

sample of what the script outputs in twitter

Download the script and set it up as a cron job (or you could execute it manually). It should work with python 2.4 and later. It stores a plurkdb.dat file (which you should probably assign an absolute path to, depending on the behavior of cron on your system). This file is checked every time it is run to make sure that duplicate plurks aren’t being tweeted. You should pass the following parameters on the command line (or modify the script so they are hardcoded, if you want): <twitter username> <twitter password> <plurk username> <plurk password>. Update: see later post on updated plurk script.  And like with all software, use at your own risk.

Please let me know if you have any problems with it or see room for improvement. I hacked this out in a hurry, so …

OpenEphyra is a question answering (QA) system developed here at the Language Technologies Institute by Nico Schlaefer. He began his work at the University of Karlsruhe in Germany, but has since continued it at CMU and is currently a PhD student here. Since it is a home-grown language technologies package, I decided to check it out and play around. This is the first QA system I have used that wasn’t integrated in a search engine, so this isn’t exactly an expert review.

Getting started in Windows (or Linux or whatever) is pretty easy if you already have Apache ant and Java installed. Ant isn’t necessary, but I recommend getting it if you don’t have it already. It’s just handy. First, download the OpenEphyra package from sourceforge. The download is about 59 MB and once it’s done unpack it in whatever directory you want. Assuming you have ant installed, all you have to do is type ant to build it, though you may want to issue ant clean first. I had to make one slight change to the build.xml file to get it to run, which was on line 55: <jvmarg line="-server& #13;-Xms512m& #13;-Xmx1024m"/>, which had to be changed to <jvmarg line="-server -Xms512m -Xmx1024m"/>. Easy enough. Then to run it, all you have to do is type ant OpenEphyra.

After taking a short bit to load up, you can enter questions on the command line. Based on what I can tell from the output, it begins by normalizing the question (removing morphology, getting rid of punctuation). Then it determines the type of answer it is looking for, like a person’s name or a place and assigns certain properties to what it expects to find. Next it automatically creates a list of queries that are sent to the search engine(s). The documentation indicates that the AQUAINT, AQUAINT-2 and BLOG06 corpora are included (at least preprocessing is supported), but there are searchers for Google, Wikipedia, Yahoo and several others. Indri is a search engine which supports structured queries and OpenEphyra auto-generates some structured queries from what I saw playing around. After generating the queries, they are sent to the various searchers and results are obtained and scored. Finally, if you’re lucky, you get an answer to your question.

Here are the results of screwing around with it for a few minutes:

  • Who created OpenEphyra?
    • no answer (sorry, Nico)
  • Who invented the cotton gin?
    • Eli Whitney
  • Who created man?
    • God
  • What is the capital of Mongolia?
    • Ulaanbaatar
  • Who invented the flux capacitor?
    • Doc Brown (awesome!)
  • Who is the author of the Mendicant Bug?
    • Zuckerberg — damn you, Facebook! :(
  • How much wood can a woodchuck chuck?
    • no answer (correct)
  • What is the atomic number of Curium?
    • 96 (also correct)
  • Who killed Lord Voldemort?
    • Harry (correct, but partial)
  • How many rings for elven kings?
    • 3021 (so, so very wrong)

Fun stuff! It’s not anywhere near perfect, but there are definite uses and the thing is ridiculously easy to install and use. Also, it’s in Java, so you can integrate it with your own system with very little effort. Depending on what sort of question you are looking for answers to, you get various levels of results. Factual questions about geography and people tend to do better than questions about numbers and fiction, as you might expect. Also, why-questions are not supported.

Another bonus is the project is open source, so if you’re into QA, you can help develop it.

Recommended Reading

Posted: 23 December 2007 in Uncategorized
Tags: , , , , , ,

I think this should be required reading for any novice programmer and probably even more so for established programmers. Agree with him or not, I think you’ll agree that Steve Yegge has some interesting things to say. My favorite quote:

“Bigger is just something you have to live with in Java. Growth is a fact of life. Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.”

This is especially interesting to me as I just jumped on the IDE bandwagon.  I received a few interesting comments  on that post that are worth reading.  A minor theme was the fact that you just can’t handle a massive code base without some kind of IDE (Integrated Development Environment).  I have worked with a code base of about 20,000 lines of Java with no IDE and there were certainly challenges.  I have also worked with a code base of over 100k lines of C (not ++) and that was a pain in the butt.  Massive changes took me days to complete and then weeks to debug.  Having an IDE would have made it easier, but it also would have made it much larger.  It is so easy to bloat up code with every kind of get/set method and constructor there is, but many of them are never used.  Is that a bad thing or just good future planning?  There is definitely a trade off, and one that probably comes down on the side of bad thing more often than not.

In any case, it’s something I have to keep in mind as I go forward with my new project.


Posted: 1 December 2007 in Uncategorized
Tags: , , , , , , , ,

When I was around 12 or 13, I first got a hold of my stepfather’s physics text book. It was magic. The rules that governed the physical world were right there in the form of equations on a page. I was totally captivated. Newton’s laws of motion, gravity, angular momentum, and the theory of relativity. When I first learned about relativistic time dilation, it was life-changing. I resolved to become an astrophysicist. A lot of changes happened in my life that turned that dream into my current one. But, like all first loves, it never went away.

When I got my first computer, I had hopes of writing a program that would plot the positions of the stars as they were in space (3-D) versus how they appeared in the Earth’s sky (2-D). I achieved a little bit of success getting the vectors worked out from the distance, right ascension, declination and so on. I had no easy way of visualizing it though. Doing 3-D plots in BASIC back in 1990 wasn’t the easiest thing in the world. So that project died.

Then like a ghost, Celestia came to me last night. Wrapped up in her open source glory, I dared not even dream that she could perform what I had so long abandoned all hope of. But she did my friend, she did. (My wife won’t like this imagery :))


More confirmation that Vista blows

Posted: 18 August 2007 in Uncategorized
Tags: , , , ,

Well, the now-former editor-in-chief of that great citadel of Microsoft-brown-nosing PC Magazine, is now swearing off Vista. You know when Microsoft’s lackeys begin jumping ship that something is wrong. This should certainly convince anyone if the fact that most businesses are reluctant to switch to Vista or that China will use XP for computer systems relating to next year’s Olympics.

I love this quote from the Olympics tech guy at Lenovo:

“At the Olympics, we need the most reliable and stable systems.” (source)

That just says it all.

Full disclosure: I do not use nor do I plan to use Vista.
Note: I was informed the PC Mag link isn’t permanent, so I have linked to the digg article instead.

Beautiful Bolide

Posted: 13 August 2007 in Uncategorized
Tags: , , , , , , , , , ,

As I mentioned in a previous post, I visited my family in Ohio this past weekend. The Perseid meteor shower peaked Sunday night/Monday morning, but the shower was going fairly strong Saturday night/Sunday morning. For the first time since the early 90’s I got a chance to sit out beneath the stars in perfect weather with no moon to watch a meteor shower.


Here Be Patent Trolls

Posted: 7 August 2007 in Uncategorized
Tags: , , ,

It appears that Facebook has just had another lawsuit filed against it, this time for patent infringement. Cross Atlantic Capital Partners (hereafter referred to as the Patent Trolls) claim infringement based on patent 6,519,629: “System for creating a community for users with common interests to interact in”. So it appears they are targeting the Groups functionality. I don’t pretend to know anything about patent law, but from what I have read in the patent, it does appear they have a case. However, I believe this is just one more example of how issuing patents for software is fundamentally flawed.