Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 43900 invoked from network); 1 Feb 2011 13:55:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Feb 2011 13:55:24 -0000 Received: (qmail 87063 invoked by uid 500); 1 Feb 2011 13:55:23 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 86627 invoked by uid 500); 1 Feb 2011 13:55:20 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 86618 invoked by uid 99); 1 Feb 2011 13:55:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Feb 2011 13:55:19 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Feb 2011 13:55:12 +0000 Received: by qwe4 with SMTP id 4so6926282qwe.35 for ; Tue, 01 Feb 2011 05:54:50 -0800 (PST) MIME-Version: 1.0 Received: by 10.229.215.9 with SMTP id hc9mr5282836qcb.117.1296568490103; Tue, 01 Feb 2011 05:54:50 -0800 (PST) Received: by 10.229.83.208 with HTTP; Tue, 1 Feb 2011 05:54:50 -0800 (PST) X-Originating-IP: [128.16.9.141] In-Reply-To: <1296559676.20400.1418327987@webmail.messagingengine.com> References: <1296059354.3260.19.camel@soheb-1201N> <1296333246.2931.12.camel@soheb-1201N> <1296345418.18365.1417931133@webmail.messagingengine.com> <1296559676.20400.1418327987@webmail.messagingengine.com> Date: Tue, 1 Feb 2011 13:54:50 +0000 Message-ID: Subject: Re: Distributed Indexing From: William Mayor To: dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hello Thanks for your prompt reply. In regards to using a SolrDocument instead of Strings (and I agree that List doesn't seem to be the best way of going) how do I get reference to a SolrDoc? As far as I can see I have access to a List that represents all of the files being POSTed. Do I want to open these streams, get the info and then stream them out? This seems wasteful. I had instead thought that the DistributedUpdatedRequestHandler would take this List, create some kind mapping between each stream and a unique id and then pass the ids to the policy. Thanks for your help Billy On Tue, Feb 1, 2011 at 11:27 AM, Upayavira wrote: > On Tue, 01 Feb 2011 00:26 +0000, "William Mayor" > wrote: >> Hi Guys >> >> I've had a go at creating the ShardDistributionPolicy interface and a >> few implementations. I've created a patch >> (https://issues.apache.org/jira/browse/SOLR-2341) let me know what >> needs doing. > > >> Currently I assume that the documents passed to the policy will be >> represented by some kind of identifier and that one needs only to >> match the ID with a shard. This is better (I think) than reading the >> document from the POST and figuring out some kind of unique >> identifier? > > Your code looks fine to me, except it should take in a SolrDocument > object or list of, rather than strings. Then, for your Hash version, you > can take a hash of the "id" field. > >> A question we've had about this is who decides what policy to use and >> where do they specify? I'm inclided to think that the user (the person >> POSTing data) does not mind what policy is used but the administrator >> might. This leads me to think that the policy should be set in the >> solr config file? My collegues disagree that the user will not mind >> and would rather see the policy be specified in the url. We've noticed >> that request handlers can be specified in both so should we adopt this >> idea instead (and as a kind of comprimise :) ). > > To stick with Solr conventions, you would specify the > ShardDistributionPolicy in the solrconfig.xml, within the configuration > of your DistributedUpdateRequestHandler, so in that sense, it is hidden > from your users and managed by the administrator. > > However, if you follow this approach, an administrator could expose > multiple policies by having multiple DistributedUpdateRequestHandler > definitions in solrconfig.xml, with different URLs. > > To give you an example, but for search rather than indexing: > > =A0 =A0default=3D"true"> > =A0 =A0 > =A0 =A0 > =A0 =A0 =A0 dismax > =A0 =A0 > =A0 > > This will configure requests to http://localhost:8983/solr/dismax?q=3Dbla= h > > to be handled by the dismax query parser. > > More relevant to you: > > =A0 =A0default=3D"true"> > =A0 =A0 > =A0 =A0 > =A0 =A0 =A0 =A0 =A0 =A0 name=3D"shards">http://localhost:8983/solr,http://localhost:7= 983/solr > =A0 =A0 > =A0 > > This would, by default, distribute all queries to > http://localhost:8983/solr/distrib?q=3Dblah across two Solr instances at > the URLs described. > > For now, I'd say see if you can add a > distributionPolicyClass=3D"org.apache.solr.blah" to define the class that > this updateRequestHandler is going to use. > > To everyone else who got this far - please chip in if you see better > ways of doing this. > > Upayavira > >> All the best >> >> William >> >> On Sat, Jan 29, 2011 at 11:56 PM, Upayavira wrote: >> > Lance, >> > >> > Firstly, we're proposing a ShardDistributionPolicy interface for which >> > there is a default (mod of the doc ID) but other implementations are >> > possible. Another easy implementation would be a randomised or round >> > robin one. >> > >> > As to threading, the first task would be to put all of the source >> > documents into "buckets", one bucket per shard, using the above >> > ShardDistributionPolicy to assign documents to buckets/shards. Then al= l >> > of the documents in a "bucket" could be sent to the relevant shard for >> > indexing (which would be nothing more than a normal HTTP post (or solr= j >> > call?)). >> > >> > As to whether this would be single threaded or multithreaded, I would >> > guess we would aim to do it the same as the distributed search code >> > (which I have not yet reviewed). However, it could presumably be >> > single-threaded, but use asynchronous HTTP. >> > >> > Regards, Upayavira >> > >> > On Sat, 29 Jan 2011 15:09 -0800, "Lance Norskog" >> > wrote: >> >> I would suggest that a DistributedRequestUpdateHandler run >> >> single-threaded, doing only one document at a time. If I want more >> >> than one, I run it twice or N times with my own program. >> >> >> >> Also, this should have a policy object which decides exactly how >> >> documents are distributed. There are different techniques for >> >> different use cases. >> >> >> >> Lance >> >> >> >> On Sat, Jan 29, 2011 at 12:34 PM, Soheb Mahmood >> >> wrote: >> >> > Hello Yonik, >> >> > >> >> > On Thu, 2011-01-27 at 08:01 -0500, Yonik Seeley wrote: >> >> >> Making it easy for clients I think is key... one should be able to >> >> >> update any node in the solr cluster and have solr take care of the >> >> >> hard part about updating all relevant shards. =A0This will most li= kely >> >> >> involve an update processor. =A0This approach allows all existing = update >> >> >> methods (including things like CSV file upload) to still work >> >> >> correctly. >> >> >> >> >> >> Also post.jar is really just for testing... a command-line replace= ment >> >> >> for "curl" for those who may not have it. =A0It's not really a >> >> >> recommended way for updating Solr servers in production. >> >> > >> >> > OK, I've abandoned the post.jar tool idea in favour of a >> >> > DistributedUpdateRequestProcessor class (I've been looking into oth= er >> >> > classes like UpdateRequestProcessor, RunUpdateRequestProcessor, >> >> > SignatureUpdateProcessorFactory, and SolrQueryRequest to see how th= ey >> >> > are used/what data they store - hence why I've taken some time to >> >> > respond). >> >> > >> >> > My big question now is that is it necessary to have a Factory class= for >> >> > DistributedUpdateRequestProcessor? I've seen this lots of times, as= in >> >> > RunUpdateProcessorFactory (where the factory class was only a few l= ines >> >> > of code) to SignatureUpdateProcessorFactory? At first I was thinkin= g it >> >> > would be a good design idea to include it in (in a generic sense), = but >> >> > then I thought harder and I thought that the >> >> > DistributedUpdateRequestHander would only be running once, taking i= n all >> >> > the requests, so it seems sort of pointless to write one in. >> >> > >> >> > That is my "burning" question for now. I have got a few more questi= ons, >> >> > but I'm sure that when I look further into the code, I'll either ha= ve >> >> > more or all of my questions are answered. >> >> > >> >> > Many thanks! >> >> > >> >> > Soheb Mahmood >> >> > >> >> > >> >> > -------------------------------------------------------------------= -- >> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> >> > For additional commands, e-mail: dev-help@lucene.apache.org >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> goksron@gmail.com >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> >> For additional commands, e-mail: dev-help@lucene.apache.org >> >> >> > --- >> > Enterprise Search Consultant at Sourcesense UK, >> > Making Sense of Open Source >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: dev-help@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: dev-help@lucene.apache.org >> > --- > Enterprise Search Consultant at Sourcesense UK, > Making Sense of Open Source > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: dev-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org