Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-Id: <CD7E62C4-6D9F-49DD-8C38-D48A0083B972@garambrogne.net>
From: Mathieu Lecarme <mathieu@garambrogne.net>
To: java-user@lucene.apache.org
In-Reply-To: <b739adf00803011805x3cab48a0mde7447ddadbc5e05@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed; delsp=yes
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Does Lucene support partition-by-keyword indexing?
Date: Sun, 2 Mar 2008 12:44:52 +0100
References: <b739adf00803011016w71128166mf2f110b1aa17651a@mail.gmail.com>
 <49373220-7421-4FCB-B8C2-A04C7DB1F7D9@garambrogne.net>
 <b739adf00803011805x3cab48a0mde7447ddadbc5e05@mail.gmail.com>


Le 2 mars 08 =C3=A0 03:05, =E4=BB=87=E5=AF=85 a =C3=A9crit :

> Hi,
>
> I agree with your point that it is easier to partition index by =20
> document.
> But the partition-by-keyword approach has much greater scalability =20
> over the
> partition-by-document approach. Each query involves communicating with
> constant number of nodes; while partition-by-doc requires spreading =20=

> the
> query a long all or many of the nodes. And I am actually doing some =20=

> small
> research on this. By the way, the documents to be indexed are not
> necessarily web pages. They are mostly files stored on each node's =20
> file
> system.
>
> Node failures are also handled by replicas. The index for each term =20=

> will be
> replicated on multiple nodes, whose nodeIDs are near to each other. =20=

> This
> mechanism is handled by the underlying DHT system.
>
> So any idea how can partition index by keyword in lucene? Thanks.

When you read a file, and tokenize it, you dispatch token in =20
differents index, with a unique Document ID.

Can you explain more things about the context of your application?

I don't know why you need P2P. Is it for file sharing? so, index =20
should be near document.
Is it for distributed computed? use central data and hadoop Map/Reduce.
If you wont a cluster of lucene for heavy querying, use the rsync + mv =20=

trick of Technorati.
If you persist with Term dispatching, use it only for caching. Each =20
node provides a Term index of their Document. When you search =20
something, the parsed query gives you every Term (I can give you code =20=

for that), you first ask wich node contains that Term, and after, you =20=

send the Query to this nodes.

M.=

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org