lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject Re: Best Practices for Distributing Lucene Indexing and Searching
Date Thu, 14 Jul 2005 23:38:25 GMT
My punt was that having workers create sub-indexs (creating the  
documents and making a partial index) and shipping the partial index  
back to the queen to merge may be more efficient.  It's probably not,  
I was just using the day as a chance to see if it looked promising,  
and get my hands dirty with SEDA (and ActiveMQ, great JMS provider).   
The bit I forgot about until too late was the merge is going to have  
to be 'serialized' by the queen, there's no real way to get around  
that I could think of. Worse, because the .merge(Directory[]) method  
does an automatic optimize I had to get the Queen to temporarily  
store the partial indexes locally until it had all results before  
doing a merge, otherwise the queen would be spending _forever_ doing  
merges.

If you had the workers just creating Documents, I'm not sure that  
would get over the overhead of the SEDA network/jms transport.  The  
document doesn't really get inverted until it's sent to the Writer?   
In this example the Queen would be doing 99% of the work wouldn't it?  
(defeating the purpose of SEDA).

What I was aiming at was an easily growable, self-healing Indexing  
network.  Need more capacity? Throw in a Mac-Mini or something, and  
let it join as a worker via Zeroconf.  Worker dies?  work just gets  
done somewhere else.  True Google-like model.

But I don't think I quite got there.

Re: document in multiple indexes.  The queen was the co-ordinator, so  
it's first task was to work out what entities to index (everything in  
our app can be referenced via a URI, mail://2358 or doc://9375).  The  
queen works out all the URI's to index, batches them up into a work- 
unit and broadcasts that as a work message on the work queue.  The  
worker takes that package, creates a relatively small index  
(5000-10000 documents, whatever the chunk size is), and ships the  
directory as a zipped up binary message back onto the Result queue.   
Therefore there were never any duplicates.  A worker either processed  
the work unit, or it didn't.

Paul

On 15/07/2005, at 9:28 AM, Otis Gospodnetic wrote:

> Interesting.  I'm planning on doing something similar for some new
> Simpy features.  Why are your worker bees sending whole indices to the
> Queen bee?  Wouldn't it be easier to send in Documents and have the
> Queen index them in the same index?  Maybe you need those individual,
> smaller indices to be separate....
>
> How do you deal with the possibility of the same Document being  
> present
> in multiple indices?
>
> Otis
>
>
> --- Paul Smith <psmith@aconex.com> wrote:
>
>
>> I had a crack at whipping up something along this lines during a 1
>> day hackathon we held here at work, using ActiveMQ as the bus between
>>
>> the 'co-ordinator' (Queen bee) and the 'worker" bees.  The index work
>>
>> was segmented as jobs on a work queue, and the workers feed the
>> relatively smal index chunks back along a result queue, which the co-
>>
>> ordinator then merged in.
>>
>> The tough part from observing the outcome is knowing what the chunk
>> size should be, because in the end the co-ordinator needs to merge
>> all the sub-indexes together into 1 and for a large index that's not
>>
>> an insignificant time.  You also have to use bookkeeping to work out
>>
>> if a 'job' has not been completed in time (maybe failure by the
>> worker) and decide whether the job should be resubmitted (in theory
>> JMS with transactions would help there, but then you have a
>> throughput problem on that too).
>>
>> Would love to see something like this work really well, and perhaps
>> generalize it a bit more.  I do like the simplicity of the SEDA
>> principles.
>>
>> cheers,
>>
>> Paul Smith
>>
>>
>> On 14/07/2005, at 11:50 PM, Peter Gelderbloem wrote:
>>
>>
>>> I am currently looking into building a similar system and came
>>>
>> across
>>
>>> this architecture:
>>> http://www.eecs.harvard.edu/~mdw/proj/seda/
>>>
>>> I am just reading up on it now. Does anyone have experience
>>>
>> building a
>>
>>> lucene system based on this architecture? Any advice would be
>>>
>> greatly
>>
>>> appreciated.
>>>
>>> Peter Gelderbloem
>>>
>>>    Registered in England 3186704
>>> -----Original Message-----
>>> From: Luke Francl [mailto:luke.francl@stellent.com]
>>> Sent: 13 May 2005 22:04
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Best Practices for Distributing Lucene Indexing and
>>> Searching
>>>
>>> On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
>>>
>>>
>>>
>>>> I don't really consider reading/writing to an NFS mounted
>>>>
>> FSDirectory
>>
>>>>
>>>>
>>> to
>>>
>>>
>>>> be viable for the very reasons you listed; but I haven't really
>>>>
>> found
>>
>>>>
>>>>
>>> any
>>>
>>>
>>>> evidence of problems if you take they approach that a single
>>>>
>> "writer"
>>
>>>> node indexes to local disk, which is NFS mounted by all of your
>>>>
>> other
>>
>>>> nodes for doing queries.  concurent updates/queries may still not
>>>>
>> be
>>
>>>>
>>>>
>>> safe
>>>
>>>
>>>> (i'm not sure) but you could have the writer node "clone" the
>>>>
>> entire
>>
>>>>
>>>>
>>> index
>>>
>>>
>>>> into a new directory, apply the updates and then signal the other
>>>>
>>>>
>>> nodes to
>>>
>>>
>>>> stop using the old FSDirectory and start using the new one.
>>>>
>>>>
>>>
>>> Thanks to everyone who contributed advice to my question about how
>>>
>> to
>>
>>> distribute a Lucene index across a cluster.
>>>
>>> I'm about to start on the implementation and I wanted to clarify
>>> something about using NFS that Chris wrote about above.
>>>
>>> There are many warnings about indexing on an NFS file system, but
>>> is it
>>> safe to have a single node index, while the other nodes use the
>>>
>> file
>>
>>> system in read-only mode?
>>>
>>> On a related note, our software is cross-platform and needs to work
>>>
>> on
>>
>>> Windows as well. Are there any problems known problems having a
>>> read-only index shared over SMB?
>>>
>>> Using a shared file system is preferable to me because it's easier,
>>>
>>
>>
>>> but
>>> if it's necessary I will write the code to copy the index to each
>>> node.
>>>
>>> Thanks,
>>> Luke Francl
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>>
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message