lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <s...@elyograg.org>
Subject Re: DIH
Date Sat, 15 Feb 2014 09:07:20 GMT
On 2/14/2014 10:45 PM, William Bell wrote:
> On virtual cores the DIH handler is really slow. On a 12 core box it only
> uses 1 core while indexing.
> 
> Does anyone know how to do Java threading from a SQL query into Solr?
> Examples?
> 
> I can use SolrJ to do it, or I might be able to modify DIH to enable
> threading.
> 
> At some point in 3.x threading was enabled in DIH, but it was removed since
> people where having issues with it (we never did).

If you know how to fix DIH so it can do multiple indexing threads
safely, please open an issue and upload a patch.

I'm still using DIH for full rebuilds, but I'd actually like to replace
it with a rebuild routine written in SolrJ.  I currently achieve decent
speed by running DIH on all my shards at the same time.

I do use SolrJ for once-a-minute index maintenance, but the code that
I've written to pull data out of SQL and write it to Solr is not able to
index millions of documents in a single thread as fast as DIH does.  I
have been building a multithreaded design in my head, but I haven't had
a chance to write real code and see whether it's actually a good design.

For me, the bottleneck is definitely Solr, not the database.  I recently
wrote a test program that uses my current SolrJ indexing method.  If I
skip the "server.add(docs)" line, it can read all 91 million docs from
the database and build SolrInputDocument objects for them in 2.5 hours
or less, all with a single thread.  When I do a real rebuild with DIH,
it takes a little more than 4.5 hours -- and that is inherently
multithreaded, because it's doing all the shards simultaneously.  I have
no idea how long it would take with a single-threaded SolrJ program.

Thanks,
Shawn


Mime
View raw message