Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@minotaur.apache.org Received: (qmail 51276 invoked from network); 2 Aug 2009 15:10:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Aug 2009 15:10:07 -0000 Received: (qmail 49091 invoked by uid 500); 2 Aug 2009 15:10:12 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 49003 invoked by uid 500); 2 Aug 2009 15:10:11 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 48993 invoked by uid 99); 2 Aug 2009 15:10:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2009 15:10:11 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of avlesh@gmail.com designates 209.85.222.184 as permitted sender) Received: from [209.85.222.184] (HELO mail-pz0-f184.google.com) (209.85.222.184) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Aug 2009 15:10:02 +0000 Received: by pzk14 with SMTP id 14so1750963pzk.29 for ; Sun, 02 Aug 2009 08:09:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=43Rg4kUmGvbPNR6dr1jDro/4kwcZtwDQmqyUSmStULg=; b=tqdVSAyvZLxnn0IHlby6zjAfPpiGfQDNN2r6SL0oIi4jWSrJduQm7Ed2mEfWv6KYW/ nnK2SYyaLScg4UFhcMhTwn/bPgurrngh2DFcFBo5JPxHYL3GhY8t60Q5tYo0oE/guAvB p/dC4iAdQyfoGKMbUZaCFT3eFl/wxDZA5g8h0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=upS5Tr3HNYjMs9kTmejzSAjTnj0iI9acSpkXwOhKIQcbluPxveYLtFa7tKA+CtwnhJ 3+N5WK/ET0QC4KgViSIwv/+tJTGoJp4Hy7B5v1u4dzBuVdh0SrEr06wGl7aY7FYhFEru IURIPY3krdK/v/Jv6Rayu0q/PjymbUur9CDbQ= MIME-Version: 1.0 Received: by 10.142.218.11 with SMTP id q11mr18637wfg.130.1249225781996; Sun, 02 Aug 2009 08:09:41 -0700 (PDT) In-Reply-To: <5e76b0ad0908020756u2f4cb1ebkc5d347ef86af40c8@mail.gmail.com> References: <5e76b0ad0908020756u2f4cb1ebkc5d347ef86af40c8@mail.gmail.com> Date: Sun, 2 Aug 2009 20:39:41 +0530 Message-ID: Subject: Re: Queries regarding a "ParallelDataImportHandler" From: Avlesh Singh To: solr-dev@lucene.apache.org, noble.paul@gmail.com Content-Type: multipart/alternative; boundary=000e0cd303f2faef7004702a0a45 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd303f2faef7004702a0a45 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > > run the add() calls to Solr in a dedicated thread Makes absolute sense. This would actually mean, DIH sits on top of all the add/update operations making it easier to implement a multi-threaded DIH. I would create a JIRA issue, right away. However, I would still love to see responses to my problems due to limitations in 1.3 Cheers Avlesh 2009/8/2 Noble Paul =E0=B4=A8=E0=B5=8B=E0=B4=AC=E0=B4=BF=E0=B4=B3=E0=B5=8D= =E2=80=8D =E0=A4=A8=E0=A5=8B=E0=A4=AC=E0=A5=8D=E0=A4=B3=E0=A5=8D > a multithreaded DIH is in my top priority list. There are muliple > approaches > > 1) create multiple instances of dataImporter instances in the same DIH > instance and run them in parallel and commit when all of them are done > 2) run the add() calls to Solr in a dedicated thread > 3) make DIH automatically multithreaded . This is much harder to implemen= t. > > but a and #1 and #2 can be implemented with ease. It does not have to > be aother implementation called ParallelDataImportHandler. I believe > it can be done in DIH itself > > you may not need to create a project in google code. you can open a > JIRA issue and start posting patches and we can put it back into Solr. > > . > > On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh wrote: > > In my quest to improve indexing time (in a multi-core environment), I > tried > > writing a Solr RequestHandler called ParallelDataImportHandler. > > I had a few lame questions to begin with, which Noble and Shalin answer= ed > > here - > > > http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_di= h_for_parallel_indexing > > > > As the name suggests, the handler, when invoked, tries to execute > multiple > > DIH instances on the same core in parallel. Of-course the catch here is > > that, only those data-sources, that can be batched can benifit from thi= s > > handler. In my case, I am writing this for import from a MySQL database= . > So, > > I have a single data-config.xml, in which the query has to add > placeholders > > for "limit" and "offset". Each DIH instance uses the same data-config > file, > > and replaces its own values for the limit and offset (which is in fact > > supplied by the parent ParallelDataImportHandler). > > > > I am achieving this by making my handler SolrCoreAware, and creating > > maxNumberOfDIHInstances (configurable) in the inform method. These > instances > > are then initialized and registered with the core. Whenever a request > comes > > in, the ParallelDataImportHandler delegates the task to these instances= , > > schedules the remainder and aggregates responses from each of these > > instances to return back to the user. > > > > Thankfully, all of these worked, and preliminary benchmarking with > 5million > > records indicated 50% decrease in re-indexing time. Moreover, all my > cores > > (Solr in my case is hosted on a quad-core machine), indicated above 70% > CPU > > utilization. All that I could have asked for! > > > > With respect to this whole thing, I have a few questions - > > > > 1. Is something similar available out of the box? > > 2. Is the idea flawed? Is the approach fundamentally correct? > > 3. I am using Solr 1.3. DIH did not have "EventListeners" in the ston= e > > age. I need to know, if a DIH instance is done with its task (mostly > the > > "commit") operation. I could not figure a clean way out. As a hack, I > keep > > pinging the DIH instances with command=3Dstatus at regular intervals = (in > a > > separate thread), to figure out if it is free to be assigned some tas= k. > With > > works, but obviously with an overhead of unnessecary wasted CPU cycle= s. > Is > > there a better approach? > > 4. I can better the time taken, even further if there was a way for m= e > to > > tell a DIH instance not to open a new IndexSearcher. In the current > scheme > > of things, as soon as one DIH instance is done committing, a new > searcher is > > opened. This is blocking for other DIH instances (which were active) > and > > they cannot continue without the searcher being initialized. Is there= a > way > > I can implement, single commit once all these DIH instances are done > with > > their tasks? I tried each DIH instance with a commit=3Dfalse without > luck. > > 5. Can this implementation be extended to support other data-sources > > supported in DIH (HTTP, File, URL etc)? > > 6. If the utility is worth it, can I host this on Google code as an > open > > source contrib? > > > > Any help will be deeply acknowledged and appreciated. While suggesting, > > please don't forget that I am using Solr 1.3. If it all goes well, I > don't > > mind writing one for Solr 1.4. > > > > Cheers > > Avlesh > > > > > > -- > ----------------------------------------------------- > Noble Paul | Principal Engineer| AOL | http://aol.com > --000e0cd303f2faef7004702a0a45--