lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Mayor <m...@williammayor.co.uk>
Subject Re: Distributed Indexing
Date Tue, 01 Feb 2011 00:26:59 GMT
Hi Guys

I've had a go at creating the ShardDistributionPolicy interface and a
few implementations. I've created a patch
(https://issues.apache.org/jira/browse/SOLR-2341) let me know what
needs doing.

Currently I assume that the documents passed to the policy will be
represented by some kind of identifier and that one needs only to
match the ID with a shard. This is better (I think) than reading the
document from the POST and figuring out some kind of unique
identifier?

A question we've had about this is who decides what policy to use and
where do they specify? I'm inclided to think that the user (the person
POSTing data) does not mind what policy is used but the administrator
might. This leads me to think that the policy should be set in the
solr config file? My collegues disagree that the user will not mind
and would rather see the policy be specified in the url. We've noticed
that request handlers can be specified in both so should we adopt this
idea instead (and as a kind of comprimise :) ).

All the best

William

On Sat, Jan 29, 2011 at 11:56 PM, Upayavira <uv@odoko.co.uk> wrote:
> Lance,
>
> Firstly, we're proposing a ShardDistributionPolicy interface for which
> there is a default (mod of the doc ID) but other implementations are
> possible. Another easy implementation would be a randomised or round
> robin one.
>
> As to threading, the first task would be to put all of the source
> documents into "buckets", one bucket per shard, using the above
> ShardDistributionPolicy to assign documents to buckets/shards. Then all
> of the documents in a "bucket" could be sent to the relevant shard for
> indexing (which would be nothing more than a normal HTTP post (or solrj
> call?)).
>
> As to whether this would be single threaded or multithreaded, I would
> guess we would aim to do it the same as the distributed search code
> (which I have not yet reviewed). However, it could presumably be
> single-threaded, but use asynchronous HTTP.
>
> Regards, Upayavira
>
> On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" <goksron@gmail.com>
> wrote:
>> I would suggest that a DistributedRequestUpdateHandler run
>> single-threaded, doing only one document at a time. If I want more
>> than one, I run it twice or N times with my own program.
>>
>> Also, this should have a policy object which decides exactly how
>> documents are distributed. There are different techniques for
>> different use cases.
>>
>> Lance
>>
>> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood <soheb.lucene@gmail.com>
>> wrote:
>> > Hello Yonik,
>> >
>> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote:
>> >> Making it easy for clients I think is key... one should be able to
>> >> update any node in the solr cluster and have solr take care of the
>> >> hard part about updating all relevant shards.  This will most likely
>> >> involve an update processor.  This approach allows all existing update
>> >> methods (including things like CSV file upload) to still work
>> >> correctly.
>> >>
>> >> Also post.jar is really just for testing... a command-line replacement
>> >> for "curl" for those who may not have it.  It's not really a
>> >> recommended way for updating Solr servers in production.
>> >
>> > OK, I've abandoned the post.jar tool idea in favour of a
>> > DistributedUpdateRequestProcessor class (I've been looking into other
>> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor,
>> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how they
>> > are used/what data they store - hence why I've taken some time to
>> > respond).
>> >
>> > My big question now is that is it necessary to have a Factory class for
>> > DistributedUpdateRequestProcessor? I've seen this lots of times, as in
>> > RunUpdateProcessorFactory (where the factory class was only a few lines
>> > of code) to SignatureUpdateProcessorFactory? At first I was thinking it
>> > would be a good design idea to include it in (in a generic sense), but
>> > then I thought harder and I thought that the
>> > DistributedUpdateRequestHander would only be running once, taking in all
>> > the requests, so it seems sort of pointless to write one in.
>> >
>> > That is my "burning" question for now. I have got a few more questions,
>> > but I'm sure that when I look further into the code, I'll either have
>> > more or all of my questions are answered.
>> >
>> > Many thanks!
>> >
>> > Soheb Mahmood
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message