Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 67428 invoked from network); 2 Mar 2008 11:45:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Mar 2008 11:45:31 -0000 Received: (qmail 59430 invoked by uid 500); 2 Mar 2008 11:45:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 59396 invoked by uid 500); 2 Mar 2008 11:45:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 59385 invoked by uid 99); 2 Mar 2008 11:45:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Mar 2008 03:45:20 -0800 X-ASF-Spam-Status: No, hits=1.6 required=10.0 tests=SPF_NEUTRAL,SUBJECT_FUZZY_TION X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [212.27.42.28] (HELO smtp2-g19.free.fr) (212.27.42.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Mar 2008 11:44:45 +0000 Received: from smtp2-g19.free.fr (localhost.localdomain [127.0.0.1]) by smtp2-g19.free.fr (Postfix) with ESMTP id 4934012B6F4 for ; Sun, 2 Mar 2008 12:44:53 +0100 (CET) Received: from [192.168.1.100] (ze.garambrogne.net [82.227.122.98]) by smtp2-g19.free.fr (Postfix) with ESMTP id 2273512B6E6 for ; Sun, 2 Mar 2008 12:44:53 +0100 (CET) Message-Id: From: Mathieu Lecarme To: java-user@lucene.apache.org In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: Does Lucene support partition-by-keyword indexing? Date: Sun, 2 Mar 2008 12:44:52 +0100 References: <49373220-7421-4FCB-B8C2-A04C7DB1F7D9@garambrogne.net> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org Le 2 mars 08 =C3=A0 03:05, =E4=BB=87=E5=AF=85 a =C3=A9crit : > Hi, > > I agree with your point that it is easier to partition index by =20 > document. > But the partition-by-keyword approach has much greater scalability =20 > over the > partition-by-document approach. Each query involves communicating with > constant number of nodes; while partition-by-doc requires spreading =20= > the > query a long all or many of the nodes. And I am actually doing some =20= > small > research on this. By the way, the documents to be indexed are not > necessarily web pages. They are mostly files stored on each node's =20 > file > system. > > Node failures are also handled by replicas. The index for each term =20= > will be > replicated on multiple nodes, whose nodeIDs are near to each other. =20= > This > mechanism is handled by the underlying DHT system. > > So any idea how can partition index by keyword in lucene? Thanks. When you read a file, and tokenize it, you dispatch token in =20 differents index, with a unique Document ID. Can you explain more things about the context of your application? I don't know why you need P2P. Is it for file sharing? so, index =20 should be near document. Is it for distributed computed? use central data and hadoop Map/Reduce. If you wont a cluster of lucene for heavy querying, use the rsync + mv =20= trick of Technorati. If you persist with Term dispatching, use it only for caching. Each =20 node provides a Term index of their Document. When you search =20 something, the parsed query gives you every Term (I can give you code =20= for that), you first ask wich node contains that Term, and after, you =20= send the Query to this nodes. M.= --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org