lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject Re: Best Practices for Distributing Lucene Indexing and Searching
Date Fri, 15 Jul 2005 02:04:29 GMT
answering my own question: nutch.org -> lucene.apache.org/nutch/

Excellent!

Paul

On 15/07/2005, at 11:45 AM, Paul Smith wrote:

> Cooool, I should go have a look at that.. That begs another  
> question though, where does Nutch stand in terms of the ASF?  Did I  
> read (or dream) that Nutch may be coming in under ASF?  I guess I  
> should get myself subscribed to the Nutch mailing lists.
>
> thanks Erik.
>
> Paul
>
> On 15/07/2005, at 11:36 AM, Erik Hatcher wrote:
>
>
>> Paul - it sounds an awful like my (perhaps incorrect)  
>> understanding of the MapReduce capability of Nutch that is under  
>> development.  Perhaps the work that Doug and others have done  
>> there are applicable to your situation.
>>
>>     Erik
>>
>>
>> On Jul 14, 2005, at 7:38 PM, Paul Smith wrote:
>>
>>
>>
>>> My punt was that having workers create sub-indexs (creating the  
>>> documents and making a partial index) and shipping the partial  
>>> index back to the queen to merge may be more efficient.  It's  
>>> probably not, I was just using the day as a chance to see if it  
>>> looked promising, and get my hands dirty with SEDA (and ActiveMQ,  
>>> great JMS provider).  The bit I forgot about until too late was  
>>> the merge is going to have to be 'serialized' by the queen,  
>>> there's no real way to get around that I could think of. Worse,  
>>> because the .merge(Directory[]) method does an automatic optimize  
>>> I had to get the Queen to temporarily store the partial indexes  
>>> locally until it had all results before doing a merge, otherwise  
>>> the queen would be spending _forever_ doing merges.
>>>
>>> If you had the workers just creating Documents, I'm not sure that  
>>> would get over the overhead of the SEDA network/jms transport.   
>>> The document doesn't really get inverted until it's sent to the  
>>> Writer?  In this example the Queen would be doing 99% of the work  
>>> wouldn't it? (defeating the purpose of SEDA).
>>>
>>> What I was aiming at was an easily growable, self-healing  
>>> Indexing network.  Need more capacity? Throw in a Mac-Mini or  
>>> something, and let it join as a worker via Zeroconf.  Worker  
>>> dies?  work just gets done somewhere else.  True Google-like model.
>>>
>>> But I don't think I quite got there.
>>>
>>> Re: document in multiple indexes.  The queen was the co- 
>>> ordinator, so it's first task was to work out what entities to  
>>> index (everything in our app can be referenced via a URI, mail:// 
>>> 2358 or doc://9375).  The queen works out all the URI's to index,  
>>> batches them up into a work-unit and broadcasts that as a work  
>>> message on the work queue.  The worker takes that package,  
>>> creates a relatively small index (5000-10000 documents, whatever  
>>> the chunk size is), and ships the directory as a zipped up binary  
>>> message back onto the Result queue.  Therefore there were never  
>>> any duplicates.  A worker either processed the work unit, or it  
>>> didn't.
>>>
>>> Paul
>>>
>>> On 15/07/2005, at 9:28 AM, Otis Gospodnetic wrote:
>>>
>>>
>>>
>>>
>>>> Interesting.  I'm planning on doing something similar for some new
>>>> Simpy features.  Why are your worker bees sending whole indices  
>>>> to the
>>>> Queen bee?  Wouldn't it be easier to send in Documents and have the
>>>> Queen index them in the same index?  Maybe you need those  
>>>> individual,
>>>> smaller indices to be separate....
>>>>
>>>> How do you deal with the possibility of the same Document being  
>>>> present
>>>> in multiple indices?
>>>>
>>>> Otis
>>>>
>>>>
>>>> --- Paul Smith <psmith@aconex.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> I had a crack at whipping up something along this lines during a 1
>>>>> day hackathon we held here at work, using ActiveMQ as the bus  
>>>>> between
>>>>>
>>>>> the 'co-ordinator' (Queen bee) and the 'worker" bees.  The  
>>>>> index work
>>>>>
>>>>> was segmented as jobs on a work queue, and the workers feed the
>>>>> relatively smal index chunks back along a result queue, which  
>>>>> the co-
>>>>>
>>>>> ordinator then merged in.
>>>>>
>>>>> The tough part from observing the outcome is knowing what the  
>>>>> chunk
>>>>> size should be, because in the end the co-ordinator needs to merge
>>>>> all the sub-indexes together into 1 and for a large index  
>>>>> that's not
>>>>>
>>>>> an insignificant time.  You also have to use bookkeeping to  
>>>>> work out
>>>>>
>>>>> if a 'job' has not been completed in time (maybe failure by the
>>>>> worker) and decide whether the job should be resubmitted (in  
>>>>> theory
>>>>> JMS with transactions would help there, but then you have a
>>>>> throughput problem on that too).
>>>>>
>>>>> Would love to see something like this work really well, and  
>>>>> perhaps
>>>>> generalize it a bit more.  I do like the simplicity of the SEDA
>>>>> principles.
>>>>>
>>>>> cheers,
>>>>>
>>>>> Paul Smith
>>>>>
>>>>>
>>>>> On 14/07/2005, at 11:50 PM, Peter Gelderbloem wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> I am currently looking into building a similar system and came
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> across
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> this architecture:
>>>>>> http://www.eecs.harvard.edu/~mdw/proj/seda/
>>>>>>
>>>>>> I am just reading up on it now. Does anyone have experience
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> building a
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> lucene system based on this architecture? Any advice would be
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> greatly
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> appreciated.
>>>>>>
>>>>>> Peter Gelderbloem
>>>>>>
>>>>>>    Registered in England 3186704
>>>>>> -----Original Message-----
>>>>>> From: Luke Francl [mailto:luke.francl@stellent.com]
>>>>>> Sent: 13 May 2005 22:04
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Re: Best Practices for Distributing Lucene Indexing and
>>>>>> Searching
>>>>>>
>>>>>> On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I don't really consider reading/writing to an NFS mounted
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> FSDirectory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> be viable for the very reasons you listed; but I haven't really
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> found
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> any
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> evidence of problems if you take they approach that a single
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> "writer"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> node indexes to local disk, which is NFS mounted by all of your
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> other
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> nodes for doing queries.  concurent updates/queries may still
 
>>>>>>> not
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> be
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> safe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> (i'm not sure) but you could have the writer node "clone" the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> entire
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> index
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> into a new directory, apply the updates and then signal the 

>>>>>>> other
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> nodes to
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> stop using the old FSDirectory and start using the new one.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Thanks to everyone who contributed advice to my question about  
>>>>>> how
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> distribute a Lucene index across a cluster.
>>>>>>
>>>>>> I'm about to start on the implementation and I wanted to clarify
>>>>>> something about using NFS that Chris wrote about above.
>>>>>>
>>>>>> There are many warnings about indexing on an NFS file system, but
>>>>>> is it
>>>>>> safe to have a single node index, while the other nodes use the
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> file
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> system in read-only mode?
>>>>>>
>>>>>> On a related note, our software is cross-platform and needs to  
>>>>>> work
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> on
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Windows as well. Are there any problems known problems having a
>>>>>> read-only index shared over SMB?
>>>>>>
>>>>>> Using a shared file system is preferable to me because it's  
>>>>>> easier,
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> but
>>>>>> if it's necessary I will write the code to copy the index to each
>>>>>> node.
>>>>>>
>>>>>> Thanks,
>>>>>> Luke Francl
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message