Posts Tagged ‘search engines’

At the Atlanta Semantic Web Meetup tonight, Vishy Dasari gave us a quick description and demo of a new search engine called Semantifi.  They purportedly are a search engine for the deep web, meaning the web that is not indexed by traditional search engines because the content is dynamic.  They are just in the very early stages, but have opened the site for people to play with and add data to via “Apps.”  These apps are sort of like agents that respond to queries, returning results to some marshal process that decides which App will get the right to answer.  Results are ranked by some method I wasn’t able to ascertain, but it reminded me of how Amy Iris works.  These apps form the backbone of the Semantifi system, it seems, and they are crowdsourcing their creation.  You can create a very simple app to return answers on your own data set in a few short minutes.

Perhaps more interesting is that they use a natural language interface in addition to the standard query sort of interface we’re all used to.  Given the small amount of data currently available, I couldn’t really determine just how well this interface performs.  It is based on a cognitive theory by John Hawks (sp?) that apparently states we think in terms of patterns.  That’s very general and I haven’t been able to chase down that reference — and I forgot to ask Vishy for more info at the meetup.  If someone can clear that up for me, I’d be grateful.  The only seemingly relevant John Hawks I could find is a paleoanthropologist, so not sure.  Anyhow, these patterns are what Vishy says the system uses to interpret natural language input.  That may be a grandiose way of saying n-gram matching.

While Wolfram|Alpha is a computational knowledge engine™, Semantifi does not make that claim. Apps may compute certain things like mortgage values, but it’s not a general purpose calculator.  However, Semantifi is looking at bringing in unstructured data from blogs and the like, that W|A ignores.  It remains to be seen what that will look like, though.  Also, users can contribute to Semantifi while W|A is a black box.  In any case, they are making interesting claims and I look forward to seeing how they play out with more data.

Note: All of my observations are based on notes and memories of tonight’s presentation, so if I made any mistakes please post corrections in the comments or email me.


This week has given me two new toys to play with, and you could probably say both were bought at the dollar store.  The first was Microsoft‘s release of Rebranded Live, aka Bing.  Bing’s search results have been poor (for me), but not much poorer than Google‘s.  Just enough poorer for me to see no reason to really switch, which is very bad for Microsoft.  There are neat little features, like pop up feed links for blog posts and previews.  I like it, but it’s not much.  Where they shine is in image search, which incorporates similar image search already (Google still has theirs in Labs).  Google Similar Images knocked my socks off at first, but then it just seemed like it should be renamed Google Identical Images.  Not much diversity.  Bing got this part right.  The images are similar, not identical.  There is a diverse collection and the navigation is great.  Kudos, Live Labs, for that one.  Is it perfect?  Nope, but it’s better than what I was using.

The next toy was Google Squared, which inspired this tweet right after I tried it:

Google Squared.  You had me at hello.

Google Squared. You had me at hello.

Further playing around with it convinced me that this would have been a nice tool to have when I was doing ridiculous term papers in high school.  Term papers about crap I didn’t care about.  Basically random stuff.  G^2 is great for that, but really not very helpful otherwise.  It was pretty awesome finding out the number of victims of 30 different serial killers all at once, though.  As quality improves (assuming it does), this could be pretty useful.  Quality has to get there though.  90% of time using it is trial and error trying to find something that works.  I was able to add some sorting algorithms to a square, but couldn’t find a single column to add that actually had something in it (that wasn’t absurd).  Wolfram|Alpha is still the winner in the knowledge engine department, methinks.

Some Google Squared Results

Some Google Squared Results

Reblog this post [with Zemanta]

Perhaps you’ve heard of the latest brainchild of the Wunderkind Stephen WolframWolfram|Alpha.  Matthew Hurst nicknamed it Alphram today and I agree that’s a much better name.   Wolfram|Alpha (W|A henceforth) is not a search engine, it’s a knowledge engine.  It will compete with Google on a slice of traffic that Google really isn’t all that hot in for now, comparative questioning answering.  When you ask Google something like “How does the GDP of South Africa compare to China?” you hope you get back something relevant in the first few results (spoiler alert:  you don’t).  When you ask that of W|A, you get exactly what you’re looking for.  Beautiful.  W|A’s so-called natural language interface isn’t perfect, though.  You get a lot of flakiness from it until you start to recognize what works and what doesn’t.

Now let’s be honest.  How often do we search for that kind of thing?  Not very often.  I think that’s partly because Google is notoriously bad at it.  Once we start to get a handle on what W|A is capable of, I think people will start expecting more of their friendly neighborhood search giant.  Google claims to have a few tricks up its sleeves, but everything I’ve seen out of Google lately has been such a disappointment I am deeply skeptical.  The new trick is called Google Squared and it returns search results in a spreadsheet format, breaking down the various facets of the things you are searching for.  In the demo, it shows stuff like rollercoaster drop speeds, heights, etc when you search for roller coasters.  You can add to the square and do some pretty nifty stuff.  TechCrunch claims this will kill W|A.  I think the two could be complementary.  Based on the demo, I expect W|A will return results of a higher calibre, but will miss out on a lot of queries because the knowledge is just missing.  Google Squared appears to be doing something fuzzier and will return results that might be really bad.  So while W|A just says it doesn’t know, Google Squared will let you pick through the junk to find the gem.  Google Squared is expected to launch later this month in Google Labs.

Many have said that where W|A will really compete is against Wikipedia and I am inclined to agree.  There are plenty of things I go to Wikipedia for now that I probably will switch over to W|A for, like populations of countries, size of Neptune’s moons, and so on.  Wikipedia still wins for more in-depth knowledge on a topic.  W|A also does some pretty cool stuff when you search for the definition of a word (use a query like “word kitten“).  You learn that kitten comes from Classical Latin, and entered English about 700 years ago.  You can find out a similar thing (and go further in depth for the etymology at least) using the American Heritage dictionary on, but W|A requires less digging.

And this brings me around to a key point with W|A.  It’s an awesome factoid answering service.  It does it well and it does it in a pretty way.  Stuff you can find in more depth elsewhere you can get quickly and easily, but only superficially via W|A.  There are links to more information, though, so you don’t lose much by relying on W|A — assuming it has knowledge about what you’re looking for.  You’re still going to be more likely to hit a brick wall with W|A.

And of course, since Wolfram developed Mathematica, W|A is backed by it.  Enter an equation and you get some really handy math info back.  Need to quickly know the derivative of a fairly complicated equation?  Presto.  Probably the most satisfying feeling I got today was from a query similar to “what is the area under x^4+3x^2+4 from 1 to 8?”  Let’s see you answer that, Google Squared.

Wolfram|Alpha sample results

Reblog this post [with Zemanta]

I just finished reading about relevance-based language models for information retrieval (Lavrenko and Croft, 2001).  It’s an old paper, but some new stuff I was checking into relied on something else which relied on it — you know how the story goes.

In information retrieval, there are many retrieval models that have been used over the years.  Word on the street is that Google uses the vector space model, where the words in a document are represented as a vector.  Each word is its own dimension and the magnitude along that dimension is some weighting based on the number of times the word appears in that document.  A new query is converted into a vector in this space and how well a document matches the query is the distance between the two vectors.  This glosses over a lot of details, but that’s the general idea.

Another technique is to use language modeling.  A language model is built for each document and then the distance between the language model for a query and the language model for each document is used to rank the most relevant documents.  Again, a multitude of details have been glossed over.  The language modeling approach does a great job, and seems to be more theoretically grounded than the vector space model.  However, the vector space model does really well and there are many optimizations that make it easy to compute for huge datasets.

One thing that retrieval models have tried to do is model the documents relevant to a query.  These are the documents you want to return when a person searches for something.  If you knew the exact set of these documents, your job would be done and information retrieval would be solved.  So, it’s not an easy task.  and is further complicated by the fact that not everybody agrees which are the relevant documents for a particular query.  In probabilistic retrieval models this was done mainly with clunky heuristics that weren’t theoretically sound.  What Lavrenko and Croft (2001) did was create a formal approach to estimating the relevance model without any training data.  Sounds sweet, right?

What it amounts to is something called pseudo-relevance feedback.  Relevance feedback is the case where results are refined for queries based on labeled training data.  We know some relevant documents for certain queries, so we can use that to improve results for new queries.  Pseudo-relevance feedback requires no labeled data, but instead finds a way to simulate having the relevant documents.  Lavrenko and Croft did this by approximating the probability that a word would appear in the set of relevant documents by calculating the probability that the word would co-occur with the queries.

The handy part is you don’t have to do any pesky parameter estimation.  We just have to compute a bunch of probabilities, do some smoothing, and then hold our collective breath.  Check out the paper for details. 


Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States). SIGIR ’01. ACM, New York, NY, 120-127. [pdf]

Cuil Fail

Posted: 28 July 2008 in Uncategorized
Tags: , , , , , ,

The blagoblag is abuzz with word of cuil, a search engine launched by some former Google engineers.  After many hours of downtime, I was able to check it out a short while ago.  The unfortunate result:  it blows.  It’s so bad that it’s as bad as your brain can comprehend.  Supposedly there are three times as many sites indexed as Google.  Well, that’s because they have not filtered any of the spam sites out.  A search for “mendicant bug” yields multiple spam copies of my blog and some wordpress category pages on the first ten pages.  My blog is conspicuously missing.  A search for my name also yields pathetic garbage.  Multiple other searches all led to the same thing:  spam pages get the highest rankings.

If your goal is searching for spam, then try out cuil.  You might get lucky and get infected by some nasty spyware.  Otherwise, don’t waste your time.

When you go to a search engine, you have an information need. There is something you are searching for that you can only articulate imprecisely and you do so in a few words. People are good at determining if something satisfies their information need, but not so great at stating it clearly. Librarians are trained to elicit this information need from you, by force if necessary. (Just kidding, librarian mafia, don’t hurt me!) Their method is a dialogue where they probe the various aspects of what you are searching for, what you are not searching for, what you already know about it, etc.

A search engine can’t engage in this dialogue, yet, but think about how you interact with a search engine. You start off with this information need (at whatever degree of vagueness) in mind and probably compose a short 2-3 word query. How often do you do one word queries? We’ve been trained by search engines that this rarely succeeds unless it’s a low-frequency word (or a brand name or jargon). Our first query brings up some useful stuff perhaps, but usually we see that we weren’t thinking clearly about our information need and we begin honing it over the next couple queries until we find what we need. Some people are better at forming this mental picture and stating clear queries from the beginning [citation needed], but most people need to narrow it down.

These queries we use for Google are often purely keyword queries, though sometimes we use slightly more sophisticated queries with link: or site: (etc) operators. You can make sure terms are included with the + operator and excluded with the – operator. You can even use wildcard operators (*) which can be nice (but also touchy). What you can’t do are structured queries. You can’t search for things like (nice or sweet) and (man or guy). You can’t search for words that co-occur in certain spans of documents (like 50-word windows). These things can be very helpful to an experienced researcher and having this ability over a web corpus the size of Google’s would be enormously helpful. Unfortunately, the computational and storage costs of such a thing are much higher.

So my question for you, reader, is would you even use this?  Would this be used by very many people or just the odd few researchers, paralegals, etc?  Computationally, I think Google could handle this.  The problem would come from the larger index to handle supporting such queries.  Even this would probably not be unreasonable for Google at this point.  So… why not?  My guess is the cost of doing such a thing (moderate to high) versus the customer demand (low to nil).

Am I wrong?