Saturday, February 14, 2009

Nutch Search Engine: Useful information sources for beginners

I have been dabbling with Nutch, an open source web search software developed in Java, for the past 6 months or so. The idea is to use Nutch to crawl, index and search our Intranet. Over time, I have read and have browsed through a lot of articles, discussions, mailing list threads and other documentation online about Nutch, and best practices in its usage. Here, I am listing down a few of the useful URLs, and some tips and tricks that would help in a systematic approach towards understanding Nutch from scratch.

Getting the Nutch source, be it a stable version (currently, the latest one is Nutch 0.9) or a nightly build, and setting it up on its job is fairly straightforward, thanks to a wealth of documentation that have become available over the years. For beginners, who are considering the option of using Nutch, the best place to start learning more about it, is definitely the two part "Introduction to Nutch" by Tom White here and here.

Yes, these articles are rather dated, but they do a great job of introducing the basic concepts behind the Nutch Project.

Another piece of gem is the paper "Building Nutch: Open Source Search", by Mike Cafarella and Doug Cutting. The latter also happens to be the "father" of Lucene, Nutch and Hadoop.

There was an APress book released a few years back (around same time as Tom White's articles above) that talks about "Building Search Applications with Lucene and Nutch". This was authored by Jon Shoberg.

I guess this book would also be a good starting point in introducing basic Nutch and Lucene concepts, but looks like it is out of stock. I could not get a copy of that anywhere myself, and had to depend on other sources, such as the Nutch Wiki.

Yes, even here, some topics are dated, and a few others are probably erroneous, but still, if you are delving deeper, this is of great help.

Once you start using Nutch, you might observe a few unexpected behaviors at various stages. Parts of its operation might be less clear, or there could be bugs in parts of the source (well, it is very well designed and built, but still...). One place where you could unearth deeper knowledge in such a scenario is the Nutch user and developer forums here.

You can ask your queries there, and get responses from original developers, experts, or experienced users. But an advice here: we also need to give consideration to the fact that majority of the forum members are not present full time to answer user queries, and are only helping the user community as much as possible. So, to get better results out there, it is better to understand the doubt or the issue being faced very well, do a thorough search on the forum, and if you cannot find satisfactory explanation even after that, then ask the query, by phrasing it to the best of your knowledge. Most important thing to keep in mind here is that the subject line of the query must be clear, concise and specific enough to hope for better results.

I also have a couple of observations about searching on the Nutch forums. Nabble search is powered by Lucene, and it does a pretty good job of it. But performancewise, it appears a bit slow compared to the whole web search engines. I have found that doing a search on either Yahoo or Google for the Nutch issue faced, and following the results to the mail archives and Nabble forum discussion results to be a quicker and more helpful approach. Also, it is important to understand that there would be discussions on different stable versions and nightly builds of Nutch, and it would be very useful if you use the version in all the searches and conversations you would make.

Finally, few of the blogs that I have come across, that discuss Nutch in detail:

Doug Cutting's blog
Sami Siren's blog
Kai Middleton's blog

Last, but not the least, the best way to understand what is under the hood is definately by going the through the source code of Nutch!

Sunday, January 25, 2009

Introduction to Muktaakara

Open Source Software Development method has been successful in giving us many gems of software ranging from full-fledged operating systems to extremely helpful little utilities. I have been taking a look at a few of the open source projects over the last couple of years, mostly from the Apache Software Foundation, and have been using them to varying extents. The fact that I am mainly a .NET developer, and these open source projects generally come in Java, or LAMP solution stacks has been a bit of a proverbial “thorn in the flesh”, but quite a bit of an enjoyable challenge too.

In my open source customization journey, it appears that most of the little subtasks involved are always trying to cross that thin line between being an enterprising challenge and a painful roadblock. I feel that one of the main reasons for this state is the lack of proper documentation, which is synonymous with most open source development projects. So, what you can expect in this blog here, are a few problems I have faced in getting certain things working in a few of these open source softwares that I am using, and an account of how I was able to solve it for myself, if ever. This is in a hope that it might become useful for someone else, facing same kind of issues in using them. I wouldn’t want to call myself an expert, and most of these solutions would be quick fixes which worked for me, after going through a collection of whatever documentation was available, discussions that have taken place around similar issues, and the likes.