lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Indexing performance with solrj vs. direct lucene API
Date Thu, 29 Nov 2012 23:00:41 GMT
Hi Robert,

SolrJ is sending data over a socket so that might explain some of the lag.
Are is your SolrJ app and the Solr server running on the same physical
machine?

I thought Mark M's idea sounded good.

One other idea:

When initializing SolrJ's connection for normal searching you probably use
HttpSolrServer.

But when doing massive updates, you might consider using
ConcurrentUpdateSolrServer instead.

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Wed, Nov 28, 2012 at 10:02 AM, Robert Stewart <bstewart.ny@gmail.com>wrote:

> I have a project where I am porting existing application from direct
> Lucene API usage to using SOLR and SOLRJ client API.
>
> The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
> than using direct Lucene API.
>
> I am creating batches of documents between 200 and 500 documents per
> call to add() using SOLRJ.
>
> I tried adjusting SOLR parameters for indexing but did not make any
> difference.
>
> Documents are identical (same fields) in both cases.
>
> Nearly identical settings for tokenizing/analyzing/indexing/storing
> for each field with Lucene and SOLR.
>
> What could be the possible bottleneck in this case?   Can there
> significant over-head unpacking batch of documents in request?  Is
> there some SOLR over-head in update handler?
>
> I have tried both SOLR 3.6 and 4.0 with very similar results.
>
> When using SOLR 4.0 I have transaction logging (for NRT search) turned off.
>
> I am also NOT using a unique ID field.
>
> Performance for indexing 200 documents is around 250ms on SOLR, about
> 60ms on Lucene.
>
> I see that response time wrapping call to SOLRJ API add() method, and
> response time logged in SOLR log is nearly the same, so there is very
> little network overhead in this case.
>
> Is this typical amount of overhead to use SOLRJ+SOLR vs local Lucene API?
>
> The reason it matters in this case is application needs to rebuilt
> index once per day which currently takes about 45 minutes.  Using
> SOLRJ+SOLR it will take several hours, which is a show stopper in this
> case.
>
> Thanks.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message