Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 70397 invoked from network); 6 Mar 2006 06:01:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Mar 2006 06:01:27 -0000 Received: (qmail 51482 invoked by uid 500); 6 Mar 2006 06:02:09 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51452 invoked by uid 500); 6 Mar 2006 06:02:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51441 invoked by uid 99); 6 Mar 2006 06:02:09 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Mar 2006 22:02:09 -0800 X-ASF-Spam-Status: No, hits=1.7 required=10.0 tests=RCVD_FAKE_HELO_DOTCOM,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of prasenjitm@aol.com designates 152.163.225.130 as permitted sender) Received: from [152.163.225.130] (HELO omr-r02.mail.aol.com) (152.163.225.130) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Mar 2006 22:02:08 -0800 Received: from aol.com (10.146.145.194) by omr-r02.mail.aol.com with ESMTP; 06 Mar 2006 01:01:47 -0500 Message-ID: <440BD048.4060806@aol.com> Date: Mon, 06 Mar 2006 11:31:44 +0530 From: Prasenjit Mukherjee User-Agent: Mozilla Thunderbird 1.0.7-1.1.fc4 (X11/20050929) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Distributed Lucene.. Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I already have an implementation of a distributed crawler farm, where crawler instances are runnign on different boxes. I want to come up with a distributed indexing scheme using lucene and take advantage of the distributed nature of my crawlers' distributed nature. Here is what I am thinking. Crawlers will analyze and tokenize the content for every URLs(aka Documents) and create the following data for every url document: > > ...... > And then based on some partitioning function the carwlers can send a subset of tokens(aka terms) to the indexing server. The partitioning function can be as simple as based on the starting character of the terms. Lets say if we have 5 indexers, we will distribute the indexing data in the following manner : Indexer1 - a-e Indexer2 - f-j Indexer3 - k-o Indexer4 - p-t Indexer5 - u-z Does it make any sense ? Also would like to know if there are other ways to distribute lucene's indexing/searching ? thanks, prasen --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org