hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@almaden.ibm.com>
Subject Re: Question: index package in contrib (lucene index)
Date Fri, 29 May 2009 21:49:05 GMT
Reply inlined below.

IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099


Tenaali Ram <tenaaliram@gmail.com> wrote on 05/28/2009 03:18:53 PM:

> Hi,
> I am trying to understand the code of index package to build a
> Lucene index. I have some very basic questions and would really
> if someone can help me understand this code-
> 1) If I already have Lucene index (divided into shards), should I upload
> these indexes into HDFS and provide its location or the code will pick
> shards from local file system ?

Yes, you need to put the old index to HDFS first.

> 2) How is the code adding a document in the lucene index, I can see there
> a index selection policy. Assuming round robin policy is chosen, how is
> code adding a document in the lucene index? This is related to first
> question - is the index where the new document is to be added in HDFS or
> local file system. I read in the README that the index is first created
> local file system, then copied back to HDFS. Can someone please point me
> the code that is doing this.

See contrib.index.example.

> 3) After the map reduce job finishes, where are the final indexes ? In
> ?

They will be in HDFS.

> 4) Correct me if I am wrong- the code builds multiple indexes, where each
> index is an instance of Lucene Index having a disjoint subset of
> from the corpus. So, if I have to search a term, I have to search each
> and then merge the result. If this is correct, then how is the IDF of a
> which is a global statistic computed and updated in each index ? I mean
> index can compute the IDF wrt. to the subset of documents that it has,
> can not compute the global IDF of a term (since it knows nothing about
> indexes, which might have the same term in other documents).

This package only deals with index builds. The shards are disjoint and it's
up to the index server to calculate the ranks. For distributed TF/IDF
support, you may want to look into Katta.

> Thanks,
> -T
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message