hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anoop Bhatti" <anoop.bha...@gmail.com>
Subject Re: Distributed Lucene - from hadoop contrib
Date Thu, 14 Aug 2008 17:24:22 GMT

I was able to make a distributed Lucene index using the
hadoop.contrib.index code, and then search over that index while it is
still in hdfs.  I never used Distributed Lucene or katta.

The key is to use the org.apache.hadoop.dfs.DistributedFileSystem
class for Lucene (see code below)

I tested this on a Lucene index in a clustered environment, with
pieces of the index residing on different machines, and it does query
successfully.  The search time is fast (although the index is only

I'd like to know if I'm heading down the right path, so my questions are:
* Has anyone tried searching a distributed Lucene index using a method
like this before?  It seems too easy.  Are there any "gotchas" that I
should look out for as I scale up to more nodes and a larger index?

* Do you think that going ahead with this approach, which consists of
1) creating a Lucene index using the  hadoop.contrib.index code
(thanks, Ning!) and 2) leaving that index "in-place" on hdfs and
searching over it using the client code below, is a good approach?

* What is the status of the bailey project?  It seems to be working on
the same type of problem. Should I wait until that project comes out
with code?

Here's my code:

import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.contrib.index.lucene.FileSystemDirectory;
import org.apache.hadoop.dfs.DistributedFileSystem;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.Directory;

public class LuceneQuery {
	public static void main(String[] args) throws Exception {

		FileSystem fs = new DistributedFileSystem();
		Configuration conf = new Configuration();

                //master that has the name node (fs.default.name)
		fs.initialize(new URI("hdfs://master:54310"), conf);

		//path to the lucene index directory on the master
                Path path = new Path("/indexlocation/00000");
		Directory dir = new FileSystemDirectory(fs, path, false, conf);

		IndexSearcher is = new IndexSearcher(dir);
		Analyzer analyzer = new StandardAnalyzer();
		QueryParser parser = new QueryParser("content", analyzer);
		Query query = parser.parse("searchTerm");
		Hits hits = is.search(query);

                //print out the "id" field of the results
		for (int i = 0; i < hits.length(); i++) {
			Document doc = hits.doc(i);



Anoop Bhatti
Committed to open source technology.

On Tue, Aug 12, 2008 at 7:19 PM, Deepika Khera <deepikak@collarity.com> wrote:
> Thank you for your response.
> I was imagining the 2 concepts of i) using hadoop.contrib.index to index
> documents ii) providing search in a distributed fashion, to be all in
> one box.
> So basically, hadoop.contrib.index is used to create lucene indexes in
> a distributed fashion (by creating shards-each shard being a lucene
> instance). And then I can use Katta or any other Distributed Lucene
> application to serve lucene indexes distributed over many servers.
> Deepika
> -----Original Message-----
> From: Ning Li [mailto:ning.li.00@gmail.com]
> Sent: Friday, August 08, 2008 7:08 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Distributed Lucene - from hadoop contrib
>> 1) Katta n Distributed Lucene are different projects though, right?
> Both
>> being based on kind of the same paradigm (Distributed Index)?
> The design of Katta and that of Distributed Lucene are quite different
> last time I checked. I pointed out the Katta project because you can
> find the code for Distributed Lucene there.
>> 2) So, I should be able to use the hadoop.contrib.index with HDFS.
>> Though, it would be much better if it is integrated with "Distributed
>> Lucene" or the "Katta project" as these are designed keeping the
>> structure and behavior of indexes in mind. Right?
> As described in the README file, hadoop.contrib.index uses map/reduce
> to build Lucene instances. It does not contain a component that serves
> queries. If that's not sufficient for you, you can check out the
> designs of Katta and Distributed Index and see which one suits your
> use better.
> Ning

View raw message