Saturday, February 14, 2009

Nutch Search Engine: Useful information sources for beginners

I have been dabbling with Nutch, an open source web search software developed in Java, for the past 6 months or so. The idea is to use Nutch to crawl, index and search our Intranet. Over time, I have read and have browsed through a lot of articles, discussions, mailing list threads and other documentation online about Nutch, and best practices in its usage. Here, I am listing down a few of the useful URLs, and some tips and tricks that would help in a systematic approach towards understanding Nutch from scratch.

Getting the Nutch source, be it a stable version (currently, the latest one is Nutch 0.9) or a nightly build, and setting it up on its job is fairly straightforward, thanks to a wealth of documentation that have become available over the years. For beginners, who are considering the option of using Nutch, the best place to start learning more about it, is definitely the two part "Introduction to Nutch" by Tom White here and here.

Yes, these articles are rather dated, but they do a great job of introducing the basic concepts behind the Nutch Project.

Another piece of gem is the paper "Building Nutch: Open Source Search", by Mike Cafarella and Doug Cutting. The latter also happens to be the "father" of Lucene, Nutch and Hadoop.

There was an APress book released a few years back (around same time as Tom White's articles above) that talks about "Building Search Applications with Lucene and Nutch". This was authored by Jon Shoberg.

I guess this book would also be a good starting point in introducing basic Nutch and Lucene concepts, but looks like it is out of stock. I could not get a copy of that anywhere myself, and had to depend on other sources, such as the Nutch Wiki.

Yes, even here, some topics are dated, and a few others are probably erroneous, but still, if you are delving deeper, this is of great help.

Once you start using Nutch, you might observe a few unexpected behaviors at various stages. Parts of its operation might be less clear, or there could be bugs in parts of the source (well, it is very well designed and built, but still...). One place where you could unearth deeper knowledge in such a scenario is the Nutch user and developer forums here.

You can ask your queries there, and get responses from original developers, experts, or experienced users. But an advice here: we also need to give consideration to the fact that majority of the forum members are not present full time to answer user queries, and are only helping the user community as much as possible. So, to get better results out there, it is better to understand the doubt or the issue being faced very well, do a thorough search on the forum, and if you cannot find satisfactory explanation even after that, then ask the query, by phrasing it to the best of your knowledge. Most important thing to keep in mind here is that the subject line of the query must be clear, concise and specific enough to hope for better results.

I also have a couple of observations about searching on the Nutch forums. Nabble search is powered by Lucene, and it does a pretty good job of it. But performancewise, it appears a bit slow compared to the whole web search engines. I have found that doing a search on either Yahoo or Google for the Nutch issue faced, and following the results to the mail archives and Nabble forum discussion results to be a quicker and more helpful approach. Also, it is important to understand that there would be discussions on different stable versions and nightly builds of Nutch, and it would be very useful if you use the version in all the searches and conversations you would make.

Finally, few of the blogs that I have come across, that discuss Nutch in detail:

Doug Cutting's blog
Sami Siren's blog
Kai Middleton's blog

Last, but not the least, the best way to understand what is under the hood is definately by going the through the source code of Nutch!

No comments:

Post a Comment