Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Message-ID: <4574AF58.5060508@dragonflymc.com>
Date: Mon, 04 Dec 2006 17:29:28 -0600
From: Dennis Kubes <nutch-dev@dragonflymc.com>
User-Agent: Thunderbird 1.5.0.8 (Windows/20061025)
MIME-Version: 1.0
To: hadoop-dev@lucene.apache.org
Subject: Re: distributed search
References: <J9AXT0$E4F3E8F02192B33F0BC34105F9B9AFA3@libero.it>
 <45749880.7060008@dragonflymc.com> <45749DDD.9030005@getopt.org>
In-Reply-To: <45749DDD.9030005@getopt.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

All,

We have two version of a type of index splitter.  The first version 
would run an indexing job and then using the completed index as input 
would read the number of documents in the index and take a requested 
split size.  From this it used a custom index input format to create 
splits according to document id.   We would run a job that would map out 
index urls as keys and documents with their ids wrapped in a 
SerializableWritable object as the values.  Then inside of a second job 
using the index as input we would have a MapRunner that would read the 
other supporting databases (linkdb, segments) and map all objects as 
ObjectWritables.  Then on the reduce we had a custom Output and 
OutputFormat that took all of the objects and wrote out the databases 
and indexes into each split.

There was a problem with this first approach though in that writing out 
an index from a previously serialized document would lose any fields 
that are not stored (which is most).  So we went with a second approach.

The second approach takes a number of splits and runs through an 
indexing job on the fly.  It calls the indexing and scoring filters.  It 
uses the linkdb, crawldb, and segments as input.  As it indexes is also 
splits the databases and indexes into the number of reduce tasks so that 
the final output is multiple splits each hold a part of the index and 
its supporting databases.  Each of  the databases holds only the 
information for the urls that are in its part of the index.  These parts 
can then be pushed to separate search servers.  This type of splitting 
works well but you can NOT define a specific number of documents or urls 
per split and sometimes one split will have alot more urls than another 
if you are indexing some sites that have alot of pages (i.e. wikipedia 
or cnn archives).  This is currently how our system works.  We fetch, 
invert links, run through some other processes, and then index and split 
on the fly.  Then we use python scripts to pull each split directly from 
the DFS to each search server and then start the search servers.

We are still working on the splitter because the ideal approach would be 
to be able to specify a number of documents per split as well as to 
group by different keys, not just url.  I would be happy to share the 
current code but it is highly integrated so I would need to pull it out 
of our code base first.  It would be best if I could send it to someone, 
say Andrzej, to take a look at first.

Dennis

Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>
> [...]
>> Having a new index on each machine and having to create separate 
>> indexes is not the most elegant way to accomplish this architecture.  
>> The best way that we have found is to have an splitter job that 
>> indexes and splits the index and 
>
> Have you implemented a Lucene index splitter, i.e. a tool that takes 
> an existing Lucene index and splits it into parts by document id? This 
> sounds very interesting - could you tell us a bit about this?
>