lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Goetzke" <uwe.goet...@healy-hudson.com>
Subject AW: Does Lucene support partition-by-keyword indexing?
Date Sun, 02 Mar 2008 07:48:19 GMT
Hi,

I do not yet fully understand what you want to achieve. 
You want to spread the index split by keywords to reduce the time to distribute indexes? 
And you want the distribute queries to the nodes based on the same split mechanism? 


You have several nodes with different kind of documents.
You want to build one index for all nodes and split and distribute the index based on a set
of keywords specific to a node. This you want to do to split the queries so "each query involves
communicating with constant number of nodes".

Do documents at the nodes contain only such keywords? I doubt. 
So you need anyway a reference where the indexed doc can be found and retrieve it from its
node for display. 
You could index at each node, merge all indexes from all nodes and distribute the combined
index.
On what criteria you can split the queries? If you have a combined index each node can distribute
the queries to other nodes on statistical data found in the term distribution. 
You need to merge the results anyway.

I doubt that this kind of overhead is worth the trouble because you introduce a lot of single
points of failure. And the scalability seems limited because you would need to recalibrate
the whole network when a adding a new node. Why don't you distribute the complete index (we
do this after getting it locally zipped and later unzipped on the receiver node, size is less
than one third for transfering). Each node should have some activity indicator. Distribute
the complete query to the node with the smallest activiy. So you get redundancy, do not need
to split queries and merge results. OK, one "evil" query can bring a node "down" but the network
is still working.

Do you have any results using lucene on a single node for your approach? How many queries
and how many documents do you expect? 

Regards

Uwe

-----Ursprüngliche Nachricht-----
Von: allenchue@gmail.com [mailto:allenchue@gmail.com] Im Auftrag von ??
Gesendet: Sonntag, 2. März 2008 03:05
An: java-user@lucene.apache.org
Betreff: Re: Does Lucene support partition-by-keyword indexing?

Hi,

I agree with your point that it is easier to partition index by document.
But the partition-by-keyword approach has much greater scalability over the
partition-by-document approach. Each query involves communicating with
constant number of nodes; while partition-by-doc requires spreading the
query a long all or many of the nodes. And I am actually doing some small
research on this. By the way, the documents to be indexed are not
necessarily web pages. They are mostly files stored on each node's file
system.

Node failures are also handled by replicas. The index for each term will be
replicated on multiple nodes, whose nodeIDs are near to each other. This
mechanism is handled by the underlying DHT system.

So any idea how can partition index by keyword in lucene? Thanks.

On Sun, Mar 2, 2008 at 5:50 AM, Mathieu Lecarme <mathieu@garambrogne.net>
wrote:

> The easiest way is to split index by Document. In Lucene, index
> contains Document and inverse index of Term. If you wont to put Term
> in different place, Document will be duplicated on each index, with
> only a part of their Term.
>
> How will you manage node failure in your network?
>
> They were some trial to build big p2p search engine to compet with
> Google, but, it will be easier to split by Document.
>
> If you have to many computers and want to see them working together,
> why don't use Nutch with Hadoop?
>
> M.
> Le 1 mars 08 à 19:16, Yin Qiu a écrit :
>
> > Hi,
> >
> > I'm planning to implement a search infrastructure on a P2P overlay. To
> > achieve this, I want to first distribute the indices to various nodes
> > connected by this overlay. My approach is to partition the indices by
> > keyword, that is, one node takes care of certain keywords (or
> > terms). When a
> > simple TermQuery is encountered, we just find the node associated
> > with that
> > term (with distributed hash table) and get the result. And suppose a
> > BooleanQuery is issued, we contact all the nodes involved in this
> > query and
> > finally merge the result.
> >
> > So my question is: does Lucene support partitioning the indices by
> > keywords?
> >
> > Thanks in advance.
> >
> > --
> > Look before you leap
> > -------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Look before you leap
-------------------------------------------

-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, dürfen Sie
die Informationen nicht offen legen oder benutzen. Wenn Sie diese Email durch einen Fehler
bekommen haben, teilen Sie uns dies bitte umgehend mit, indem Sie diese Email an den Absender
zurückschicken. Bitte löschen Sie danach diese Email.
This email is confidential. If you are not the intended recipient, you must not disclose or
use this information contained in it. If you have received this email in error please tell
us immediately by return email and delete the document.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message