Posts Tagged ‘computational linguistics’

At the Atlanta Semantic Web Meetup tonight, Vishy Dasari gave us a quick description and demo of a new search engine called Semantifi.  They purportedly are a search engine for the deep web, meaning the web that is not indexed by traditional search engines because the content is dynamic.  They are just in the very early stages, but have opened the site for people to play with and add data to via “Apps.”  These apps are sort of like agents that respond to queries, returning results to some marshal process that decides which App will get the right to answer.  Results are ranked by some method I wasn’t able to ascertain, but it reminded me of how Amy Iris works.  These apps form the backbone of the Semantifi system, it seems, and they are crowdsourcing their creation.  You can create a very simple app to return answers on your own data set in a few short minutes.

Perhaps more interesting is that they use a natural language interface in addition to the standard query sort of interface we’re all used to.  Given the small amount of data currently available, I couldn’t really determine just how well this interface performs.  It is based on a cognitive theory by John Hawks (sp?) that apparently states we think in terms of patterns.  That’s very general and I haven’t been able to chase down that reference — and I forgot to ask Vishy for more info at the meetup.  If someone can clear that up for me, I’d be grateful.  The only seemingly relevant John Hawks I could find is a paleoanthropologist, so not sure.  Anyhow, these patterns are what Vishy says the system uses to interpret natural language input.  That may be a grandiose way of saying n-gram matching.

While Wolfram|Alpha is a computational knowledge engine™, Semantifi does not make that claim. Apps may compute certain things like mortgage values, but it’s not a general purpose calculator.  However, Semantifi is looking at bringing in unstructured data from blogs and the like, that W|A ignores.  It remains to be seen what that will look like, though.  Also, users can contribute to Semantifi while W|A is a black box.  In any case, they are making interesting claims and I look forward to seeing how they play out with more data.

Note: All of my observations are based on notes and memories of tonight’s presentation, so if I made any mistakes please post corrections in the comments or email me.

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

When Lazyfeed announced a limited round of beta invites on TechCrunch, I admit, I lusted after them.  Only 250?  I wanted to be one!  But alas, I was put on the waiting list.  It’s a decent marketing strategy for building up some hype.  When I finally did get my invite, I tried them out for about 5 minutes and fell prey to the distractions of the internet.  That was a bad sign, though.  Usually a new service can hold my attention for a little while longer.  So what happened?


Lazyfeed is a service that lets you enter topics, blogs, twitter, delicious and flickr accounts to form a live streaming lazyfeed.  You then get live updates in the form of your tags being updated.  Your main screen consists of a bunch of boxes with your topics and then things it guesses are related.

The hook

Lazyfeed’s marketing strategy succeeded again by giving me three invites to hand out to friends.  I offered them on Twitter, having only one person bite.  So here are the other two invites for the adventurous.  Get em while they’re hot.  If you manage to take one, please comment that you did so, so that I can at least know who you were and we can save someone else the wasted time.  I’m just throwing them into the ether like this because I don’t feel like pushing them on Twitter again.


The rub

Lazyfeed is a lovely service in terms of appearance and ajaxy goodness, but my initial impression is that it ends up being streaming information overload.  For one, the topic suggestion feature appears to be fairly naive.  Someone correct me if I’m wrong, but it looks a bit like document similarity for topics is done purely by one-for-one matching on tags.  Whatever the method, the result of their suggested topics (“Stuff for Lazy Jason”) is stuff like the following:

Lazyfeed sample results

Lazyfeed sample suggested topics

Granted, it’s a hard problem, but those results are pretty bad.  So as I started to write this post lambasting this service, I considered that maybe I was just seeing cold-start problems, and I was being unfair.  So I trained it with some additional feeds and topics that are straight-to-the-point of stuff I’m interested in, like sigir2009, topicmodeling, recommendersystems, etc.  Tags can contain no spaces, btw, which is why those don’t.  When I tried using dashes, like I often do on delicious, it gives no results.  I also removed some things that were too general or contained too many spurious results.

The light

Things started improving here, and I actually began to understand what the point of Lazyfeed is.  My initial confusion was that “Stuff for Lazy Jason” is stuff that I would want to read right now.  Being lazy, I didn’t expect to have to do work to get those things.  But “Stuff for Lazy Jason” is a list of topics it thinks I might be interested in.  Saving any one of those puts it into my lazyfeed, which is in the bar on the left.

My lazyfeed topics

My lazyfeed topics

So now what happens is that occasionally it discovers something new related to my interests and it bumps that category to the top of the list and turns it bold again (grayed out topics have been read).  Most of my topics are low traffic, so add something like mariahcarey if you want to see this functionality in action.  Now we’re getting somewhere.  It has actually started being helpful and has found me some stuff that my Google alerts haven’t.  Which is weird, and is making me think I need to double check to make sure my Google alerts are working…

The end

My takeaway after using Lazyfeed for nigh on two hours is that it’s an interesting alternative (or even extension) to RSS, but one that still hasn’t crossed the bridge to the next stage in evolution.  The idea is solid.  Automatically discover stuff in the sea of human knowledge (or human idiocy) and serve it up fresh.  The implementation lacks robust topic detection which is unfortunately going to be necessary unless it is to become another source of information overload rather than a useful stream of relevant information. Relevance is an ephemeral thing, given that your information needs change from day to day.  Lazyfeed makes it pretty easy to get rid of old topics and add new ones, even if some of their suggestions are still wonky.  It’s an interesting recommender system problem with a lot of potential.

Reblog this post [with Zemanta]

works-on-my-machine-starburstA while back I ported David Blei’s lda-c code for performing Latent Dirichlet Allocation to Ruby.  Basically I just wrapped the C methods in a Ruby class, turned it into a gem, and called it a day.  The result was a bit ugly and unwieldy, like most research code.  A few months later, Todd Fisher came along and discovered a couple bugs and memory leaks in the C code, for which I am very grateful.  I had been toying with the idea of improving the Ruby code, and embarked on a mission to do so.  The result is a hopefully much cleaner gem that can be used right out of the box with little screwing around.

Unfortunately, I did something I’m ashamed of.  Ruby gems are notorious for breaking backwards compatibility, and I have done just that.  The good news is, your code will almost work, assuming you didn’t start diving into the Document and Corpus classes too heavily.  If you did, then you will probably experience a lot of breakage.  The result, I hope is a more sensical implementation, however, so maybe you won’t hate me.  Of course, I could be wrong and my implementation is still crap.  If that’s the case, please let me know what needs to be improved.

To install the gem:

gem sources -a
sudo gem install ealdent-lda-ruby


Reblog this post [with Zemanta]

A twitter friend (@communicating) tipped me off to the UEA-Lite Stemmer by Marie-Claire Jenkins and Dan J. Smith.  Stemmers are NLP tools that get rid of inflectional and derivational affixes from words.  In English, that usually means getting rid of the plural -s, progressive -ing, and preterite -ed.  Depending on the type of stemmer, that might also mean getting rid of derivational suffixes like -ful and -ness.  Sometimes it’s useful to be able to reduce words like consolation and console to the same root form: consol.  But sometimes that doesn’t make sense.  If you’re searching for video game consoles, you don’t want to find documents about consolation.  In this case, you need a conservative stemmer.

The UEA-Lite Stemmer is a rule-based, conservative stemmer that handles regular words, proper nouns and acronyms.  It was originally written in Perl, but had been ported to Java.  Since I usually code in Ruby these days, I thought it’d be nice to make it available to the Ruby community, so I ported it over last night.

The code is open source under the Apache 2 License and hosted on github.  So please check out the code and let me know what you think.  Heck, you can even fork the project and make some improvements yourself if you want.

One direction I’d like to be able to go is to turn all of the rules into finite state transducers, which can be composed into a single large deterministic finite state transducer.  That would be a lot more efficient (and even fun!), but Ruby lacks a decent FST implementation.

Reblog this post [with Zemanta]

Perhaps you’ve heard of the latest brainchild of the Wunderkind Stephen WolframWolfram|Alpha.  Matthew Hurst nicknamed it Alphram today and I agree that’s a much better name.   Wolfram|Alpha (W|A henceforth) is not a search engine, it’s a knowledge engine.  It will compete with Google on a slice of traffic that Google really isn’t all that hot in for now, comparative questioning answering.  When you ask Google something like “How does the GDP of South Africa compare to China?” you hope you get back something relevant in the first few results (spoiler alert:  you don’t).  When you ask that of W|A, you get exactly what you’re looking for.  Beautiful.  W|A’s so-called natural language interface isn’t perfect, though.  You get a lot of flakiness from it until you start to recognize what works and what doesn’t.

Now let’s be honest.  How often do we search for that kind of thing?  Not very often.  I think that’s partly because Google is notoriously bad at it.  Once we start to get a handle on what W|A is capable of, I think people will start expecting more of their friendly neighborhood search giant.  Google claims to have a few tricks up its sleeves, but everything I’ve seen out of Google lately has been such a disappointment I am deeply skeptical.  The new trick is called Google Squared and it returns search results in a spreadsheet format, breaking down the various facets of the things you are searching for.  In the demo, it shows stuff like rollercoaster drop speeds, heights, etc when you search for roller coasters.  You can add to the square and do some pretty nifty stuff.  TechCrunch claims this will kill W|A.  I think the two could be complementary.  Based on the demo, I expect W|A will return results of a higher calibre, but will miss out on a lot of queries because the knowledge is just missing.  Google Squared appears to be doing something fuzzier and will return results that might be really bad.  So while W|A just says it doesn’t know, Google Squared will let you pick through the junk to find the gem.  Google Squared is expected to launch later this month in Google Labs.

Many have said that where W|A will really compete is against Wikipedia and I am inclined to agree.  There are plenty of things I go to Wikipedia for now that I probably will switch over to W|A for, like populations of countries, size of Neptune’s moons, and so on.  Wikipedia still wins for more in-depth knowledge on a topic.  W|A also does some pretty cool stuff when you search for the definition of a word (use a query like “word kitten“).  You learn that kitten comes from Classical Latin, and entered English about 700 years ago.  You can find out a similar thing (and go further in depth for the etymology at least) using the American Heritage dictionary on, but W|A requires less digging.

And this brings me around to a key point with W|A.  It’s an awesome factoid answering service.  It does it well and it does it in a pretty way.  Stuff you can find in more depth elsewhere you can get quickly and easily, but only superficially via W|A.  There are links to more information, though, so you don’t lose much by relying on W|A — assuming it has knowledge about what you’re looking for.  You’re still going to be more likely to hit a brick wall with W|A.

And of course, since Wolfram developed Mathematica, W|A is backed by it.  Enter an equation and you get some really handy math info back.  Need to quickly know the derivative of a fairly complicated equation?  Presto.  Probably the most satisfying feeling I got today was from a query similar to “what is the area under x^4+3x^2+4 from 1 to 8?”  Let’s see you answer that, Google Squared.

Wolfram|Alpha sample results

Reblog this post [with Zemanta]

The papers are out for WWW2009 (and have been for a bit), but I’ve only just gotten a chance to start looking at them. First of all, kudos to the ePrints people for improving the presentation of conference proceedings. This is a lot easier than having to do a Google Scholar search for each paper and hoping I find something, like I have to do with some conferences.

WWW2009 Madrid

WWW2009 Madrid

There are a lot of very interesting ones, and here are a few that bubbled to the top of my reading list:

Data Mining Track

Semantic/Data Web

Social Networks and Web 2.0

Reblog this post [with Zemanta]

There has been much ballyhoo in the blogosphere touting Google’s so-called foray into semantic search.  The blog post announcing the new feature doesn’t even mention the word semantics, but it does say it looks at associations and concepts related to your query.  I see no mention of tuples or anything of the sort and the suggested queries are the kind of thing that I would expect to come out of a background closer to document/query classification than semantic analysis.

Related search results for <i>much ado about nothing</i>

Related search results for much ado about nothing

And the results are pretty meh.  Except for taming of the shrew, those results are no-brainers.  That’s query completion quality results.  Of course you can’t judge the whole system by one isolated example.

When PC World and a host of other pop tech media zines started toasting the entrance of Google to the semantic arena, I was excited to see some cool stuff.  Imagine my disappointment when I was not only underwhelmed by the quality of the results, but by the lack of novelty.  How long has that feature been there?  Seems like I’ve seen it for ages.  Maybe it got a technological face-lift (I guess that would be a face-lift on the inside), but it looks about the same as I remember it.  Plus, its placement at the bottom of results page relegates it to search engine hell.

In summary:  boring.  My complaints are first and foremost with those elements of the blagoblag who over-hyped this.  Secondly, I am complaining to Google for not being better.  I am feeling demanding today.

Daniel’s post on it is worth reading.

Since I started blogging almost a year and a half ago, I have been following many blogs. I managed to find some blogs dealing with computational linguistics and natural language processing, but they were few and far between. Since then, I’ve discovered quite a few NLP people that have entered the blagoblag. Here is a non-exhaustive list of the many that I follow.

Many of these bloggers post sporadically and even then only post about CL/NLP occasionally. I’ve tried to organize the list into those who post exclusively on CL/NLP (at least as far as I have followed them) and those who post sporadically on CL/NLP. I would fall into the latter, since I frequently blog about my dogs, regular computer science-y and programming stuff, and other rants. P.S. I group Information Retrieval in with CL/NLP here, but only the blogs I actually read. I’m sure there’s a bazillion I don’t.

If I’ve missed one+, please let me know. I’m always on the lookout. Ditto if you think I’ve miscategorized someone.  I’ve excluded a few that haven’t posted in a while.

I got most of the books I wanted the most for Christmas this year. It was a great haul that will keep me busy for a while. Among them were:

The books on string and tree algorithms and collective intelligence should be self-explanatory. The book on data visualization I wanted because it was an overlooked skill in my education. I appreciate great data visualizations and taking some steps to improve my understanding and increase my skills in that area is worth doing. Finally the book on evolutionary computing is for personal enrichment. I’ve been playing around with genetic algorithms since 1994, even before I got out of high school. It’s always been playing, though, and I wanted a bit of a more rigorous introduction to them.

With any luck, I’ll be posting some thoughts on these books in the coming months.