lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avlesh Singh <avl...@gmail.com>
Subject Re: Queries regarding a "ParallelDataImportHandler"
Date Mon, 03 Aug 2009 11:32:58 GMT
We are generally talking about two things here -

   1. Speed up indexing in general by creating separate thread(s) for
   writing to the index. Solr-1089 should take care of this.
   2. Ability to split the DIH commands into batches, that can be executed
   in parallel threads.

My initial proposal was #2.
I see #1 as an "internal" optimization in DIH which we should anyways do.
With #2 an end user can decide how to batch the process, (e.g. In a JDBC
datasource limit and offset parameters can be used by multiple DIH
instances), how many parallel threads should be created for writing etc.

I am creating a JIRA issue for #2 and will add a more detailed description
with possible options.

Cheers
Avlesh

2009/8/3 Noble Paul നോബിള്‍ नोब्ळ् <noble.paul@corp.aol.com>

> then there is SOLR-1089 which does writes to lucene in a new thread.
>
> 2009/8/2 Noble Paul നോബിള്‍  नोब्ळ् <noble.paul@corp.aol.com>:
> > On Sun, Aug 2, 2009 at 9:39 PM, Avlesh Singh<avlesh@gmail.com> wrote:
> >>> There can be a batch command (which) will take in multiple commands in
> one
> >>> http request.
> >>
> >> You seem to be obsessed with this approach, Noble. Solr-1093 also echoes
> the
> >> same sentiments :)
> >> I personally find this approach a bit restrictive and difficult to adapt
> to.
> >> IMHO, it is better handled as a configuration. i.e. user tells us how
> the
> >> single task can be "batched" (or 'sliced', as you call it) while
> configuring
> >> the Parallel(or, MultiThreaded) DIH inside solrconfig.
> > agreed .
> >
> > I suggested this as a low hanging fruit because the changes are less
> > invasive . I'm open to anything other suggestion which you can come up
> > with.
> >
> >
> >>
> >> As an example, for non-jdbc data sources where batching might be
> difficult
> >> to achieve in an abstract way, the user might choose to configure
> different
> >> data-config.xml's (for different DIH instances) altogether.
> >>
> >> Cheers
> >> Avlesh
> >>
> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.paul@corp.aol.com>
> >>>
> >>> On Sun, Aug 2, 2009 at 8:56 PM, Avlesh Singh<avlesh@gmail.com> wrote:
> >>> > I have one more question w.r.t the MultiThreaded DIH - What would be
> the
> >>> > logic behind distributing tasks to thread?
> >>> >
> >>> > I am sorry to have not mentioned this earlier - In my case, I take
a
> >>> > "count
> >>> > query" parameter as an configuration element. Based on this count and
> >>> > the
> >>> > maxNumberOfDIHInstances, task assignment scheduling is done by
> >>> > "injecting"
> >>> > limit and offset values in the import query for each DIH instance.
> >>> > And this is, one of the reasons, why I call it a
> >>> > ParallelDataImportHandler.
> >>> There can be a batch command will take in multiple commands in one
> >>> http request. so it will be like invoking multiple DIH instances and
> >>> the user will have to find ways to split up the whole task into
> >>> multiple 'slices'. DIH in turn would fire up multiple threads and once
> >>> all the threads are returned it should issue a commit
> >>>
> >>> this is a very dumb implementation but is a very easy path.
> >>> >
> >>> > Cheers
> >>> > Avlesh
> >>> >
> >>> > On Sun, Aug 2, 2009 at 8:39 PM, Avlesh Singh <avlesh@gmail.com>
> wrote:
> >>> >
> >>> >> run the add() calls to Solr in a dedicated thread
> >>> >>
> >>> >> Makes absolute sense. This would actually mean, DIH sits on top
of
> all
> >>> >> the
> >>> >> add/update operations making it easier to implement a multi-threaded
> >>> >> DIH.
> >>> >>
> >>> >> I would create a JIRA issue, right away.
> >>> >> However, I would still love to see responses to my problems due
to
> >>> >> limitations in 1.3
> >>> >>
> >>> >> Cheers
> >>> >> Avlesh
> >>> >>
> >>> >> 2009/8/2 Noble Paul നോബിള്‍ नोब्ळ् <noble.paul@corp.aol.com>
> >>> >>
> >>> >> a multithreaded DIH is in my top priority list. There are muliple
> >>> >>> approaches
> >>> >>>
> >>> >>> 1) create multiple instances of dataImporter instances in the
same
> DIH
> >>> >>> instance and run them in parallel and commit when all of them
are
> done
> >>> >>> 2) run the add() calls to Solr in a dedicated thread
> >>> >>> 3) make DIH automatically multithreaded . This is much harder
to
> >>> >>> implement.
> >>> >>>
> >>> >>> but a and #1 and #2 can be implemented with ease. It does not
have
> to
> >>> >>> be aother implementation called ParallelDataImportHandler.
I
> believe
> >>> >>> it can be done in DIH itself
> >>> >>>
> >>> >>> you may not need to create a project in google code. you can
open a
> >>> >>> JIRA issue and start posting patches and we can put it back
into
> Solr.
> >>> >>>
> >>> >>> .
> >>> >>>
> >>> >>> On Sun, Aug 2, 2009 at 7:33 PM, Avlesh Singh<avlesh@gmail.com>
> wrote:
> >>> >>> > In my quest to improve indexing time (in a multi-core
> environment),
> >>> >>> > I
> >>> >>> tried
> >>> >>> > writing a Solr RequestHandler called ParallelDataImportHandler.
> >>> >>> > I had a few lame questions to begin with, which Noble
and Shalin
> >>> >>> answered
> >>> >>> > here -
> >>> >>> >
> >>> >>>
> >>> >>>
> http://www.lucidimagination.com/search/document/22b7371c063fdb06/using_dih_for_parallel_indexing
> >>> >>> >
> >>> >>> > As the name suggests, the handler, when invoked, tries
to execute
> >>> >>> multiple
> >>> >>> > DIH instances on the same core in parallel. Of-course
the catch
> here
> >>> >>> > is
> >>> >>> > that, only those data-sources, that can be batched can
benifit
> from
> >>> >>> > this
> >>> >>> > handler. In my case, I am writing this for import from
a MySQL
> >>> >>> > database.
> >>> >>> So,
> >>> >>> > I have a single data-config.xml, in which the query has
to add
> >>> >>> placeholders
> >>> >>> > for "limit" and "offset". Each DIH instance uses the same
> >>> >>> > data-config
> >>> >>> file,
> >>> >>> > and replaces its own values for the limit and offset (which
is in
> >>> >>> > fact
> >>> >>> > supplied by the parent ParallelDataImportHandler).
> >>> >>> >
> >>> >>> > I am achieving this by making my handler SolrCoreAware,
and
> creating
> >>> >>> > maxNumberOfDIHInstances (configurable) in the inform method.
> These
> >>> >>> instances
> >>> >>> > are then initialized and  registered with the core. Whenever
a
> >>> >>> > request
> >>> >>> comes
> >>> >>> > in, the ParallelDataImportHandler delegates the task to
these
> >>> >>> > instances,
> >>> >>> > schedules the remainder and aggregates responses from
each of
> these
> >>> >>> > instances to return back to the user.
> >>> >>> >
> >>> >>> > Thankfully, all of these worked, and preliminary benchmarking
> with
> >>> >>> 5million
> >>> >>> > records indicated 50% decrease in re-indexing time. Moreover,
all
> my
> >>> >>> cores
> >>> >>> > (Solr in my case is hosted on a quad-core machine), indicated
> above
> >>> >>> > 70%
> >>> >>> CPU
> >>> >>> > utilization. All that I could have asked for!
> >>> >>> >
> >>> >>> > With respect to this whole thing, I have a few questions
-
> >>> >>> >
> >>> >>> >   1. Is something similar available out of the box?
> >>> >>> >   2. Is the idea flawed? Is the approach fundamentally
correct?
> >>> >>> >   3. I am using Solr 1.3. DIH did not have "EventListeners"
in
> the
> >>> >>> > stone
> >>> >>> >   age. I need to know, if a DIH instance is done with
its task
> >>> >>> > (mostly
> >>> >>> the
> >>> >>> >   "commit") operation. I could not figure a clean way
out. As a
> >>> >>> > hack, I
> >>> >>> keep
> >>> >>> >   pinging the DIH instances with command=status at regular
> intervals
> >>> >>> > (in
> >>> >>> a
> >>> >>> >   separate thread), to figure out if it is free to be
assigned
> some
> >>> >>> task. With
> >>> >>> >   works, but obviously with an overhead of unnessecary
wasted CPU
> >>> >>> cycles. Is
> >>> >>> >   there a better approach?
> >>> >>> >   4. I can better the time taken, even further if there
was a way
> >>> >>> > for me
> >>> >>> to
> >>> >>> >   tell a DIH instance not to open a new IndexSearcher.
In the
> >>> >>> > current
> >>> >>> scheme
> >>> >>> >   of things, as soon as one DIH instance is done committing,
a
> new
> >>> >>> searcher is
> >>> >>> >   opened. This is blocking for other DIH instances (which
were
> >>> >>> > active)
> >>> >>> and
> >>> >>> >   they cannot continue without the searcher being initialized.
Is
> >>> >>> > there
> >>> >>> a way
> >>> >>> >   I can implement, single commit once all these DIH instances
are
> >>> >>> > done
> >>> >>> with
> >>> >>> >   their tasks? I tried each DIH instance with a commit=false
> without
> >>> >>> luck.
> >>> >>> >   5. Can this implementation be extended to support other
> >>> >>> > data-sources
> >>> >>> >   supported in DIH (HTTP, File, URL etc)?
> >>> >>> >   6. If the utility is worth it, can I host this on Google
code
> as
> >>> >>> > an
> >>> >>> open
> >>> >>> >   source contrib?
> >>> >>> >
> >>> >>> > Any help will be deeply acknowledged and appreciated.
While
> >>> >>> > suggesting,
> >>> >>> > please don't forget that I am using Solr 1.3. If it all
goes
> well, I
> >>> >>> don't
> >>> >>> > mind writing one for Solr 1.4.
> >>> >>> >
> >>> >>> > Cheers
> >>> >>> > Avlesh
> >>> >>> >
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> --
> >>> >>> -----------------------------------------------------
> >>> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>> >>>
> >>> >>
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> -----------------------------------------------------
> >>> Noble Paul | Principal Engineer| AOL | http://aol.com
> >>
> >>
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message