hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Butler, Mark (Labs)" <mark.butl...@hp.com>
Subject RE: map/reduce and Lucene integration question
Date Fri, 14 Dec 2007 09:59:55 GMT
Hi team,

First off, I would like to express that I am very impressed with Hadoop and very grateful
to everyone who has contributed to it and provided this software open source.

re: Lucene and Hadoop

I am in the process of implementing a Lucene distributed index (DLucene), based on the design
that Doug Cutting outlines here

http://www.mail-archive.com/general@lucene.apache.org/msg00338.html

Current status: I've completed the master / worker implementations. I haven't yet implemented
a throttling and garbage collection policy. I've also written unit tests. Next step is to
write a system test and the client API. Also - and unfortunately this could take a little
time - I need to get permission to release the code open source from a review board here at
HP. This is in process, but with the lawyers (sigh). DLucene is not big, the code and unit
tests are currently about 4000 lines.

Instead of using HDFS, the design of DLucene is inspired by HDFS. I decided not to use HDFS
because it is optimized for a certain type of file, and the files in Lucene are a bit different.
However I've tried to reuse code wherever I can.

There is no explicit integration with MapReduce at the moment. I wasn't aware of the way Nutch
uses this, obviously it would be good to support Nutch here.

I've made some small changes to the API Doug outlined, if others are interested, I can post
the revised interfaces, and it would be good to start a discussion about the client API as
well? And perhaps how it could be used with MapReduce?

kind regards,

Mark

-----Original Message-----
From: Enis Soztutar [mailto:enis.soz.nutch@gmail.com]
Sent: 13 December 2007 09:37
To: hadoop-user@lucene.apache.org
Subject: Re: map/reduce and Lucene integration question

Hi,

nutch indexes the documents in the org.apache.nutch.indexer.Indexer class. In the reduce phase,
the documents are output wrapped in ObjectWritable. The OutputFormat opens a local indexwriter(FileSystem.startLocalOutput()),
and adds all the documents that are collected. Then puts the index in dfs(FileSystem.completeLocalOutput()).
The resulting index has numReducer partitions.

Eugeny N Dzhurinsky wrote:
> Hello!
>
> We would like to use Hadoop to index a lot of documents, and we would
> like to have this index in the Lucene and utilize Lucene's search
> engine power for searching.
>
> At this point I am confused a bit - when we will analyze documents in
> Map part, we will end with
> - document name/location
> - list of name/value pairs to be indexed by Lucene somehow
>
> As far as I know I can write same key and different value to the
> OutputCollector, however I'm not sure how can I pass list of
> name/value pairs to the collector, or I need to think in some different way?
>
> Another question is how can I write Lucene index in reduce part, since
> as far as I know reduce can be invoked on any computer in cluster
> while Lucene index requires to have non-DFS filesystem to store it's indexes and helper
files?
>
> I heard about Nutch it can use Map/Reduce to idnex documents and store
> them in Lucene index, however quick look at it's sources didn't give
> me solid view of how is it doing this and is it doing in this way I described at all?
>
> Probably I'm missing something, so could somebody please point me to
> right direction?
>
> Thank you in advance!
>
>

Mime
View raw message