Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 57043 invoked from network); 4 Dec 2006 23:29:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2006 23:29:51 -0000 Received: (qmail 24861 invoked by uid 500); 4 Dec 2006 23:29:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 24628 invoked by uid 500); 4 Dec 2006 23:29:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 24618 invoked by uid 99); 4 Dec 2006 23:29:59 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2006 15:29:58 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [63.133.162.98] (HELO linuxfly.dragonflymc.com) (63.133.162.98) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2006 15:29:47 -0800 Received: from [192.168.1.246] (bigfly.visvo.com [192.168.1.246]) by linuxfly.dragonflymc.com (Postfix) with ESMTP id C252AC2006B for ; Mon, 4 Dec 2006 17:29:30 -0600 (CST) Message-ID: <4574AF58.5060508@dragonflymc.com> Date: Mon, 04 Dec 2006 17:29:28 -0600 From: Dennis Kubes User-Agent: Thunderbird 1.5.0.8 (Windows/20061025) MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: distributed search References: <45749880.7060008@dragonflymc.com> <45749DDD.9030005@getopt.org> In-Reply-To: <45749DDD.9030005@getopt.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org All, We have two version of a type of index splitter. The first version would run an indexing job and then using the completed index as input would read the number of documents in the index and take a requested split size. From this it used a custom index input format to create splits according to document id. We would run a job that would map out index urls as keys and documents with their ids wrapped in a SerializableWritable object as the values. Then inside of a second job using the index as input we would have a MapRunner that would read the other supporting databases (linkdb, segments) and map all objects as ObjectWritables. Then on the reduce we had a custom Output and OutputFormat that took all of the objects and wrote out the databases and indexes into each split. There was a problem with this first approach though in that writing out an index from a previously serialized document would lose any fields that are not stored (which is most). So we went with a second approach. The second approach takes a number of splits and runs through an indexing job on the fly. It calls the indexing and scoring filters. It uses the linkdb, crawldb, and segments as input. As it indexes is also splits the databases and indexes into the number of reduce tasks so that the final output is multiple splits each hold a part of the index and its supporting databases. Each of the databases holds only the information for the urls that are in its part of the index. These parts can then be pushed to separate search servers. This type of splitting works well but you can NOT define a specific number of documents or urls per split and sometimes one split will have alot more urls than another if you are indexing some sites that have alot of pages (i.e. wikipedia or cnn archives). This is currently how our system works. We fetch, invert links, run through some other processes, and then index and split on the fly. Then we use python scripts to pull each split directly from the DFS to each search server and then start the search servers. We are still working on the splitter because the ideal approach would be to be able to specify a number of documents per split as well as to group by different keys, not just url. I would be happy to share the current code but it is highly integrated so I would need to pull it out of our code base first. It would be best if I could send it to someone, say Andrzej, to take a look at first. Dennis Andrzej Bialecki wrote: > Dennis Kubes wrote: > > [...] >> Having a new index on each machine and having to create separate >> indexes is not the most elegant way to accomplish this architecture. >> The best way that we have found is to have an splitter job that >> indexes and splits the index and > > Have you implemented a Lucene index splitter, i.e. a tool that takes > an existing Lucene index and splits it into parts by document id? This > sounds very interesting - could you tell us a bit about this? >