hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tenaali Ram <tenaali...@gmail.com>
Subject Question: index package in contrib (lucene index)
Date Thu, 28 May 2009 22:18:53 GMT

I am trying to understand the code of index package to build a distributed
Lucene index. I have some very basic questions and would really appreciate
if someone can help me understand this code-

1) If I already have Lucene index (divided into shards), should I upload
these indexes into HDFS and provide its location or the code will pick these
shards from local file system ?

2) How is the code adding a document in the lucene index, I can see there is
a index selection policy. Assuming round robin policy is chosen, how is the
code adding a document in the lucene index? This is related to first
question - is the index where the new document is to be added in HDFS or in
local file system. I read in the README that the index is first created on
local file system, then copied back to HDFS. Can someone please point me to
the code that is doing this.

3) After the map reduce job finishes, where are the final indexes ? In HDFS

4) Correct me if I am wrong- the code builds multiple indexes, where each
index is an instance of Lucene Index having a disjoint subset of documents
from the corpus. So, if I have to search a term, I have to search each index
and then merge the result. If this is correct, then how is the IDF of a term
which is a global statistic computed and updated in each index ? I mean each
index can compute the IDF wrt. to the subset of documents that it has, but
can not compute the global IDF of a term (since it knows nothing about other
indexes, which might have the same term in other documents).


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message