Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 77402 invoked from network); 20 Jun 2008 18:30:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Jun 2008 18:30:22 -0000 Received: (qmail 2473 invoked by uid 500); 20 Jun 2008 18:30:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 2445 invoked by uid 500); 20 Jun 2008 18:30:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 2434 invoked by uid 99); 20 Jun 2008 18:30:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2008 11:30:16 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2008 18:29:25 +0000 Received: from [192.168.0.219] ([81.219.54.251]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id m5KISr329315 for ; Fri, 20 Jun 2008 13:28:53 -0500 Message-ID: <485BF6D1.9030601@getopt.org> Date: Fri, 20 Jun 2008 20:28:33 +0200 From: Andrzej Bialecki User-Agent: Thunderbird 2.0.0.14 (Windows/20080421) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Copying a part of index and index structure References: <295595.14316.qm@web50310.mail.re2.yahoo.com> <485B564B.2050705@getopt.org> <867513fe0806200028gade4061i6ae30dad50679a92@mail.gmail.com> In-Reply-To: <867513fe0806200028gade4061i6ae30dad50679a92@mail.gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Anshum wrote: > Hey Andrzej, > Could you tell me as to what research suggests this and why is it this way? > My calculation says the average load on each server would go down as I would > know what server to query for an index term as opposed to querying all > servers for terms. > I'm looking for a solution wherein I could break up the index based any > criteria and know what index to query for any input (and not query indexes > that would lead to zero results). * Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri, 2007: Challenges on Distributed Web Retrieval: "The disadvantage of term partitioning is having to build initially the entire global index. This does not scale well, and it is not useful in actual large scale Web search engines. There are, however, some advantages of this approach in the query processing phase. Webber et al. show that term partitioning results in lower utilization of resources [49]. More specifically, it significantly reduces the number of disk accesses and the volume of data exchanged. Document partitioning however is still better in terms of throughput, because of an uneven distribution of work load in term partitioning." * Claudine Badue, Ricardo Baeza-Yates, 2001: Distributed Query Processing Using Partitioned Inverted Files (note that their conclusion that global partitioning is more efficient than local partitioning is based on a crucial assumption of being able to distribute the load efficiently. Other papers indicate that this is a very complex issue). * Claudine Badue, Ramurti Barbosa, Paulo Golgher: Distributed Processing of Conjunctive Queries. This paper evaluates the bottlenecks in an engine with local index partitioning. * Justin Zobel, Alistair Moffat, 2006: Inverted Files for Text Search Engines * Claudio Lucchese, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, 2006: Mining Query Logs to Optimize Index Partitioning in Parallel Web Search Engines * Ronny Lempel, Shlomo Moran, 2002: Optimizing Result Prefetching in Web Search Engines with Segmented Indices ... and quite a few other papers that I don't remember now ... please do a search for "distributed IR" on ACM or Citeseer. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org