hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject Re: map/reduce and Lucene integration question
Date Thu, 13 Dec 2007 09:36:31 GMT
Hi,

nutch indexes the documents in the org.apache.nutch.indexer.Indexer 
class. In the reduce phase, the documents are output wrapped in 
ObjectWritable. The OutputFormat opens a local 
indexwriter(FileSystem.startLocalOutput()), and adds all the documents 
that are collected. Then puts the index in 
dfs(FileSystem.completeLocalOutput()). The resulting index has 
numReducer partitions.

Eugeny N Dzhurinsky wrote:
> Hello!
>
> We would like to use Hadoop to index a lot of documents, and we would like to
> have this index in the Lucene and utilize Lucene's search engine power for
> searching.
>
> At this point I am confused a bit - when we will analyze documents in Map
> part, we will end with
> - document name/location
> - list of name/value pairs to be indexed by Lucene somehow
>
> As far as I know I can write same key and different value to the
> OutputCollector, however I'm not sure how can I pass list of name/value pairs
> to the collector, or I need to think in some different way?
>
> Another question is how can I write Lucene index in reduce part, since as far
> as I know reduce can be invoked on any computer in cluster while Lucene index
> requires to have non-DFS filesystem to store it's indexes and helper files?
>
> I heard about Nutch it can use Map/Reduce to idnex documents and store them in
> Lucene index, however quick look at it's sources didn't give me solid view of
> how is it doing this and is it doing in this way I described at all?
>
> Probably I'm missing something, so could somebody please point me to right
> direction?
>
> Thank you in advance!
>
>   

Mime
View raw message