lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peri Subrahmanya <peri.subrahma...@htcinc.com>
Subject Re: Parallel Indexing
Date Mon, 22 Dec 2014 15:58:07 GMT
Thanks guys for the quick responses. I need to take the suggestions, incorporate them, figure
out how is that we are doing the fetching etc and reply back on this post. The suggestions
have been very helpful in taking this forward for us here. 

Thanks
-Peri.S

> On Dec 22, 2014, at 10:32 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Just to pile on....
> 
> _very_ frequently in my experience the problem
> is not Solr at all, but acquiring the data in the
> first place, i.e. often executing the DB query.
> 
> A very simple test is (in the SolrJ world) just comment
> out the server.add(doclist).
> 
> Assuming you're using SolrJ, you _are_ indexing in
> batches, right? And you are _not_ committing from
> the  program, right? And.... As Hossman often says,
> details matter.
> 
> Also, take a look at your Solr server CPU utilization. You
> can get a crude idea of how much work it's doing,
> unless you have it running at 100% your bottleneck is
> on the acquisition side.
> 
> For a benchmark (admittedly not directly comparable),
> I can index 11M Wikipedia docs on my laptop in < 1
> hour without tuning anything. They're in XML format
> so data acquisition is very fast...
> 
> Best,
> Erick
> 
> On Mon, Dec 22, 2014 at 7:21 AM, Mikhail Khludnev
> <mkhludnev@griddynamics.com <mailto:mkhludnev@griddynamics.com>> wrote:
>> What your indexer is build on? Do you use SolrJ, just REST, or
>> DataImportHandler? What's you DB schema is briefly?
>> Frankly speaking, there are few approaches to handle indexing concurrently,
>> details depends on the details mentioned above.
>> 
>> On Mon, Dec 22, 2014 at 5:54 PM, Peri Subrahmanya <
>> peri.subrahmanya@htcinc.com> wrote:
>>> 
>>> Hi,
>>> 
>>> We have millions of records in our db that we do a complete re-index of
>>> every fortnight or so. It takes around 11 hours or so and I was wondering
>>> if there was a way to fetch the records in batches parallel and issue the
>>> solr http command with the solr docs in parallel. Please let me know.
>>> 
>>> Thanks
>>> -Peri.S
>>> http://www.kuali.org/ole <http://www.kuali.org/ole>
>>> 
>>> 
>>> 
>>> *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended
>>> recipient, please delete without copying and kindly advise us by e-mail of
>>> the mistake in delivery.
>>> NOTE: Regardless of content, this e-mail shall not operate to bind HTC
>>> Global Services to any order or other contract unless pursuant to explicit
>>> written agreement or government initiative expressly permitting the use of
>>> e-mail for such purpose.
>>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> <http://www.griddynamics.com <http://www.griddynamics.com/>>
>> <mkhludnev@griddynamics.com <mailto:mkhludnev@griddynamics.com>>
> 
> --- 
> This message has been scanned for viruses and dangerous content by HTC E-Mail Virus Protection
Service. 



*** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please
delete without copying and kindly advise us by e-mail of the mistake in delivery.
NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to
any order or other contract unless pursuant to explicit written agreement or government initiative
expressly permitting the use of e-mail for such purpose.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message