hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@almaden.ibm.com>
Subject Re: Question: index package in contrib (lucene index)
Date Fri, 29 May 2009 21:49:05 GMT
Reply inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Tenaali Ram <tenaaliram@gmail.com> wrote on 05/28/2009 03:18:53 PM:

> Hi,
>
> I am trying to understand the code of index package to build a
distributed
> Lucene index. I have some very basic questions and would really
appreciate
> if someone can help me understand this code-
>
> 1) If I already have Lucene index (divided into shards), should I upload
> these indexes into HDFS and provide its location or the code will pick
these
> shards from local file system ?

Yes, you need to put the old index to HDFS first.

>
> 2) How is the code adding a document in the lucene index, I can see there
is
> a index selection policy. Assuming round robin policy is chosen, how is
the
> code adding a document in the lucene index? This is related to first
> question - is the index where the new document is to be added in HDFS or
in
> local file system. I read in the README that the index is first created
on
> local file system, then copied back to HDFS. Can someone please point me
to
> the code that is doing this.
>

See contrib.index.example.

> 3) After the map reduce job finishes, where are the final indexes ? In
HDFS
> ?

They will be in HDFS.

>
> 4) Correct me if I am wrong- the code builds multiple indexes, where each
> index is an instance of Lucene Index having a disjoint subset of
documents
> from the corpus. So, if I have to search a term, I have to search each
index
> and then merge the result. If this is correct, then how is the IDF of a
term
> which is a global statistic computed and updated in each index ? I mean
each
> index can compute the IDF wrt. to the subset of documents that it has,
but
> can not compute the global IDF of a term (since it knows nothing about
other
> indexes, which might have the same term in other documents).
>

This package only deals with index builds. The shards are disjoint and it's
up to the index server to calculate the ranks. For distributed TF/IDF
support, you may want to look into Katta.

> Thanks,
> -T
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message