Posts Tagged ‘natural language processing’

At the Atlanta Semantic Web Meetup tonight, Vishy Dasari gave us a quick description and demo of a new search engine called Semantifi.  They purportedly are a search engine for the deep web, meaning the web that is not indexed by traditional search engines because the content is dynamic.  They are just in the very early stages, but have opened the site for people to play with and add data to via “Apps.”  These apps are sort of like agents that respond to queries, returning results to some marshal process that decides which App will get the right to answer.  Results are ranked by some method I wasn’t able to ascertain, but it reminded me of how Amy Iris works.  These apps form the backbone of the Semantifi system, it seems, and they are crowdsourcing their creation.  You can create a very simple app to return answers on your own data set in a few short minutes.

Perhaps more interesting is that they use a natural language interface in addition to the standard query sort of interface we’re all used to.  Given the small amount of data currently available, I couldn’t really determine just how well this interface performs.  It is based on a cognitive theory by John Hawks (sp?) that apparently states we think in terms of patterns.  That’s very general and I haven’t been able to chase down that reference — and I forgot to ask Vishy for more info at the meetup.  If someone can clear that up for me, I’d be grateful.  The only seemingly relevant John Hawks I could find is a paleoanthropologist, so not sure.  Anyhow, these patterns are what Vishy says the system uses to interpret natural language input.  That may be a grandiose way of saying n-gram matching.

While Wolfram|Alpha is a computational knowledge engine™, Semantifi does not make that claim. Apps may compute certain things like mortgage values, but it’s not a general purpose calculator.  However, Semantifi is looking at bringing in unstructured data from blogs and the like, that W|A ignores.  It remains to be seen what that will look like, though.  Also, users can contribute to Semantifi while W|A is a black box.  In any case, they are making interesting claims and I look forward to seeing how they play out with more data.

Note: All of my observations are based on notes and memories of tonight’s presentation, so if I made any mistakes please post corrections in the comments or email me.

There are quite a few well-known libraries for doing various NLP tasks in Java and Python, such as the Stanford Parser (Java) and the Natural Language Toolkit (Python).  For Ruby, there are a few resources out there, but they are usually derivative or not as mature.  By derivative, I mean they are ports from other languages or extensions using code from another language.  And I’m responsible for two of them! :)

  • Treat – Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit I’ve encountered so far for Ruby
    • Text extractors for various document formats
    • Chunkers, segmenters, tokenizers
    • LDA
    • much more – the list is big
  • Ruby Linguistics – this is one of the more ambitious projects, but is not as mature as NLTK
    • interface for WordNet
    • Link grammar parser
    • some inflection stuff
  • Stanford Core NLP – if you’ve gotten a headache trying to use the Java bridge, this is your answer
  • Stanford Parser interface – uses a Java bridge to access the Stanford Parser library
  • Mark Watson has a part of speech tagger [zip], a text categorizer [zip], and some text extraction utilities [zip], but I haven’t tried to use them yet
  • LDA Ruby Gem– Ruby port of David Blei’s lda-c library by yours truly
    • Uses Blei’s c-code for the actual LDA but I include some wrappers to make using it a bit easier
  • UEA Stemmer – Ruby port (again by yours truly) of a conservative stemmer based on Jenkins and Smith’s UEA Stemmer
  • Stemmer gemPorter stemmer
  • Lingua Stemmer – another stemming library, Porter stemmer
  • Ruby WordNet – basically what’s included in Ruby Linguistics
  • Raspell – Ruby interface to Aspell spell checker

There are also a number of fledgling or orphaned projects out there purporting to be ports or interfaces for various other libraries like Stanford POS Tagger and Named Entity Recognizer.  Ruby (straight Ruby, not just JRuby) can interface just about any Java library using the Ruby Java Bridge (RJB).  RJB can be a pain, and I could only initialize it once per run (a second attempt never succeeds), so there are some limitations.  But using it, I was able to easily interface with the Stanford POS tagger.

So while there aren’t terribly many libraries for NLP tasks in Ruby, the availability of interfacing with Java directly widens the scope quite a bit.  You can also incorporate a c library using extensions.

Naturally, if I missed anything, no matter how small, please let me know.

Update: Here is a great list of AI-related ruby libraries from Dustin Smith.

Perhaps you’ve heard of the latest brainchild of the Wunderkind Stephen WolframWolfram|Alpha.  Matthew Hurst nicknamed it Alphram today and I agree that’s a much better name.   Wolfram|Alpha (W|A henceforth) is not a search engine, it’s a knowledge engine.  It will compete with Google on a slice of traffic that Google really isn’t all that hot in for now, comparative questioning answering.  When you ask Google something like “How does the GDP of South Africa compare to China?” you hope you get back something relevant in the first few results (spoiler alert:  you don’t).  When you ask that of W|A, you get exactly what you’re looking for.  Beautiful.  W|A’s so-called natural language interface isn’t perfect, though.  You get a lot of flakiness from it until you start to recognize what works and what doesn’t.

Now let’s be honest.  How often do we search for that kind of thing?  Not very often.  I think that’s partly because Google is notoriously bad at it.  Once we start to get a handle on what W|A is capable of, I think people will start expecting more of their friendly neighborhood search giant.  Google claims to have a few tricks up its sleeves, but everything I’ve seen out of Google lately has been such a disappointment I am deeply skeptical.  The new trick is called Google Squared and it returns search results in a spreadsheet format, breaking down the various facets of the things you are searching for.  In the demo, it shows stuff like rollercoaster drop speeds, heights, etc when you search for roller coasters.  You can add to the square and do some pretty nifty stuff.  TechCrunch claims this will kill W|A.  I think the two could be complementary.  Based on the demo, I expect W|A will return results of a higher calibre, but will miss out on a lot of queries because the knowledge is just missing.  Google Squared appears to be doing something fuzzier and will return results that might be really bad.  So while W|A just says it doesn’t know, Google Squared will let you pick through the junk to find the gem.  Google Squared is expected to launch later this month in Google Labs.

Many have said that where W|A will really compete is against Wikipedia and I am inclined to agree.  There are plenty of things I go to Wikipedia for now that I probably will switch over to W|A for, like populations of countries, size of Neptune’s moons, and so on.  Wikipedia still wins for more in-depth knowledge on a topic.  W|A also does some pretty cool stuff when you search for the definition of a word (use a query like “word kitten“).  You learn that kitten comes from Classical Latin, and entered English about 700 years ago.  You can find out a similar thing (and go further in depth for the etymology at least) using the American Heritage dictionary on, but W|A requires less digging.

And this brings me around to a key point with W|A.  It’s an awesome factoid answering service.  It does it well and it does it in a pretty way.  Stuff you can find in more depth elsewhere you can get quickly and easily, but only superficially via W|A.  There are links to more information, though, so you don’t lose much by relying on W|A — assuming it has knowledge about what you’re looking for.  You’re still going to be more likely to hit a brick wall with W|A.

And of course, since Wolfram developed Mathematica, W|A is backed by it.  Enter an equation and you get some really handy math info back.  Need to quickly know the derivative of a fairly complicated equation?  Presto.  Probably the most satisfying feeling I got today was from a query similar to “what is the area under x^4+3x^2+4 from 1 to 8?”  Let’s see you answer that, Google Squared.

Wolfram|Alpha sample results

Reblog this post [with Zemanta]

Since I started blogging almost a year and a half ago, I have been following many blogs. I managed to find some blogs dealing with computational linguistics and natural language processing, but they were few and far between. Since then, I’ve discovered quite a few NLP people that have entered the blagoblag. Here is a non-exhaustive list of the many that I follow.

Many of these bloggers post sporadically and even then only post about CL/NLP occasionally. I’ve tried to organize the list into those who post exclusively on CL/NLP (at least as far as I have followed them) and those who post sporadically on CL/NLP. I would fall into the latter, since I frequently blog about my dogs, regular computer science-y and programming stuff, and other rants. P.S. I group Information Retrieval in with CL/NLP here, but only the blogs I actually read. I’m sure there’s a bazillion I don’t.

If I’ve missed one+, please let me know. I’m always on the lookout. Ditto if you think I’ve miscategorized someone.  I’ve excluded a few that haven’t posted in a while.

So I was recently asked (and gave a very bad answer to) a question that has been haunting me ever since.  What is the subfield of computer science where I am the strongest?  First of all, in my undergraduate training, I was never really introduced to these ideas of subfields of CS explicitly.  I knew intuitively there was a difference between people working on databases or on operating systems, programming languages or algorithms, but it wasn’t emphasized as a choice I would ever need to make.  This is perhaps because I went to a relatively weak school in CS for my undergrad.  But now that I’m in a rather strong CS school and pursuing a CS-related masters, the question should probably have entered my mind before now.

So when asked, I floundered about for an idea and spluttered out “algorithms” just because it seemed like it was hard to go wrong there.  Well, I’ll leave the details out of this little memoire, but suffice it to say, I was wrong.  A better answer would have been “none.”  Where does natural language processing / computational linguistics fall in the list of subfields?  Is it its own?  Or is it part artificial intelligence, part algorithms, part whatever?  I’ve seen it lumped with AI more closely in the past, but unfortunately AI escaped me as a possible choice when called upon in this high-stress scenario.  Moreover, I haven’t really compartmentalized techniques as belonging to “AI” or “databases.”  Is it useful to do that?  I guess I do sometimes, but when people ask me to make big picture assessments of things I haven’t thought about much, it takes me a while to process it.

I hate interviews.

OpenEphyra is a question answering (QA) system developed here at the Language Technologies Institute by Nico Schlaefer. He began his work at the University of Karlsruhe in Germany, but has since continued it at CMU and is currently a PhD student here. Since it is a home-grown language technologies package, I decided to check it out and play around. This is the first QA system I have used that wasn’t integrated in a search engine, so this isn’t exactly an expert review.

Getting started in Windows (or Linux or whatever) is pretty easy if you already have Apache ant and Java installed. Ant isn’t necessary, but I recommend getting it if you don’t have it already. It’s just handy. First, download the OpenEphyra package from sourceforge. The download is about 59 MB and once it’s done unpack it in whatever directory you want. Assuming you have ant installed, all you have to do is type ant to build it, though you may want to issue ant clean first. I had to make one slight change to the build.xml file to get it to run, which was on line 55: <jvmarg line="-server& #13;-Xms512m& #13;-Xmx1024m"/>, which had to be changed to <jvmarg line="-server -Xms512m -Xmx1024m"/>. Easy enough. Then to run it, all you have to do is type ant OpenEphyra.

After taking a short bit to load up, you can enter questions on the command line. Based on what I can tell from the output, it begins by normalizing the question (removing morphology, getting rid of punctuation). Then it determines the type of answer it is looking for, like a person’s name or a place and assigns certain properties to what it expects to find. Next it automatically creates a list of queries that are sent to the search engine(s). The documentation indicates that the AQUAINT, AQUAINT-2 and BLOG06 corpora are included (at least preprocessing is supported), but there are searchers for Google, Wikipedia, Yahoo and several others. Indri is a search engine which supports structured queries and OpenEphyra auto-generates some structured queries from what I saw playing around. After generating the queries, they are sent to the various searchers and results are obtained and scored. Finally, if you’re lucky, you get an answer to your question.

Here are the results of screwing around with it for a few minutes:

  • Who created OpenEphyra?
    • no answer (sorry, Nico)
  • Who invented the cotton gin?
    • Eli Whitney
  • Who created man?
    • God
  • What is the capital of Mongolia?
    • Ulaanbaatar
  • Who invented the flux capacitor?
    • Doc Brown (awesome!)
  • Who is the author of the Mendicant Bug?
    • Zuckerberg — damn you, Facebook! :(
  • How much wood can a woodchuck chuck?
    • no answer (correct)
  • What is the atomic number of Curium?
    • 96 (also correct)
  • Who killed Lord Voldemort?
    • Harry (correct, but partial)
  • How many rings for elven kings?
    • 3021 (so, so very wrong)

Fun stuff! It’s not anywhere near perfect, but there are definite uses and the thing is ridiculously easy to install and use. Also, it’s in Java, so you can integrate it with your own system with very little effort. Depending on what sort of question you are looking for answers to, you get various levels of results. Factual questions about geography and people tend to do better than questions about numbers and fiction, as you might expect. Also, why-questions are not supported.

Another bonus is the project is open source, so if you’re into QA, you can help develop it.

In previous posts on cognate identification, I discussed the difference between strict and loose cognates. Loose cognates are words in two languages that have the same or similar written forms. I also described how approaches to cognate identification tend to differ based on whether the data being used is plain text or phonetic transcriptions. The type of data informs the methods. With plain text data, it is difficult to extract phonological information about the language so approaches in the past have largely been about string matching. I will discuss some of the approaches that have been taken below the jump.  In my next posting, when I get around to it, I will begin looking at some of the phonetic methods that have been applied to the task. (more…)

FSMNLP 2008 (Finite State Methods and Natural Language Processing) has issued their first Call for Papers (CFP). The deadline is May 11, 2008 and the conference will take place on September 11-12, 2008. Not the best time to be travelling perhaps, but this year it will be in Ispra, Lago Maggiore, Italy! That’s in the far north of Italy, right next to the Swiss border. From the pictures I’m finding on Google, it’s a gorgeous resort area.

Lago Maggiore - site of FSMNLP 2008

The sorts of things they are interested in include:

  • NLP applications and linguistic aspects of finite state methods
  • Finite state models of language
  • Practices for building lexical transducers for the world’s languages
  • Specification and implementation of sets, relations, and multiplicities in NLP using finite state devices
  • Machine learning of finite state models of natural language
  • Finite state manipulation software

The special theme this year will be on high performance finite state systems in large scale NLP applications.

I am going to try really hard to get something together for it this year. I had a project last year that was potentially worth submitting, but I wasn’t able to get it done in time. Unfortunately, it has languished since then as other, more pressing matters have superceded it. Going to Northern Italy ought to be motivation enough, though, don’t you think?

The entire CFP is below the jump and is also available on their website:


In my previous post on cognate identification, I gave two definitions for cognates: strict and loose (orthographic). Strict cognates are words in two related languages that descended from the same word in the ancestor language. Loose cognates are words in two languages that are spelled or pronounced similarly (depending on the data consists of phonetic transcriptions or plain text). These two definitions help form the basis for how I choose to classify approaches to doing cognate identification, but the source of data is the bigger factor, in my opinion. The orthographic approach looks at plain text and attempts to do some sort of string matching or statistical correlation based on the written (typeset) characters of the language. The phonetic approach relies on phonetic transcriptions of words in the language. Phonetic transcriptions are usually done in the International Phonetic Alphabet (IPA) but any standard form of representing sounds will work. One such example is the Carnegie Mellon Pronouncing Dictionary. Phonetic approaches may use string matching techniques, but there are also a number of inductive methods based on phonology that have been tried to good effect.

So a good question might be why does the data being used matter so much to these techniques? Why not classify the two approaches as to whether they look for loose or strict cognates? Might there not be another way of classifying the approaches to cognate identification beyond these two? Or is there an entirely different set of classes that would better describe them? To answer the last two questions, I will say that there very well may be better ways of classifying these algorithms. As Anil pointed out in the comments to my last post, the two definitions lend themselves to different applications. From the papers that I read, it seemed that when researchers looked at plain text data, there was a completely different mindset than in papers where researchers used phonetic transcriptions. For the former, the goal was usually finding translational equivalences in bitext and for the latter the goal is more as an aid to linguists attempting to reconstruct dead languages or establish relationships between languages.

With plain text, it is very difficult to infer sound correspondences between two languages. In Old English, the orthography developed by scribes corresponded directly to the spoken form. As English changed over the 1000+ years since then, the orthographic forms of words have frozen in some cases and not in others. For example, the word knight was originally spelled cniht and the c and h were both pronounced. The divergence of orthographic and phonetic forms can result in any number of problems and so it influences the ways of thinking about the task. On the other hand, phonetic approaches suffer due to data scarcity. Obtaining phonetic transcriptions is expensive as it requires the effort of linguists or individuals with specific, extensive training in the area. There are ways of obtaining phonetic transcriptions automatically, but these methods are not perfect and so result in noisy data, making this data practically useless for historical linguists.

In my next post, I will go into orthographic approaches in more detail, describing some of the papers I looked at and the methods they used. After that, I will begin discussing phonetic approaches, which are more numerous. I will also begin to look at how machine learning is being used to tackle cognate identification.

View all posts on cognate identification.

So you want to automatically parse sentences without having to go through all the trouble of figuring it out for yourself? You’ve come to the right place. This brief tutorial is aimed at students who are interested in computer science and linguistics who maybe want to dip their feet in the water of computational linguistics without having to understand immediately all of the daunting details. In other words, what I wish I had two years ago before applying to graduate school.