Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of yydzero@gmail.com designates
 209.85.212.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <9297F6B1-867C-4F9D-8D45-CC53868046A3@robustlinks.com>
References: <9297F6B1-867C-4F9D-8D45-CC53868046A3@robustlinks.com>
Date: Sun, 11 Mar 2012 22:18:53 +0800
Message-ID: 
 <CA+ZQTxrT9WCXV8gt4aYCYRTm98mdoVS3LAnZ+DyoB9xuTjEiMg@mail.gmail.com>
Subject: Re: Faster Solr Indexing
From: Yandong Yao <yydzero@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=f46d043c81bc2de9c004baf84e9d

--f46d043c81bc2de9c004baf84e9d
Content-Type: text/plain; charset=ISO-8859-1

I have similar issues by using DIH,
and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
consumes most of the time when indexing 10K rows (each row is about 70K)
    -  DIH nextRow takes about 10 seconds totally
    -  If index uses whitespace tokenizer and lower case filter, then
addDoc() methods takes about 80 seconds
    -  If index uses whitespace tokenizer, lower case filer, WDF, then
addDoc uses about 112 seconds
    -  If index uses whitespace tokenizer, lower case filer, WDF and porter
stemmer, then addDoc uses about 145 seconds

We have more than million rows totally, and am wondering whether i am using
sth. wrong or is there any way to improve the performance of addDoc()?

Thanks very much in advance!


Following is the configure:
1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
2) Solr version 3.5
3) solrconfig.xml  (almost copied from solr's  example/solr directory.)

  <indexDefaults>

    <useCompoundFile>false</useCompoundFile>

    <mergeFactor>10</mergeFactor>
    <!-- Sets the amount of RAM that may be used by Lucene indexing
         for buffering added documents and deletions before they are
         flushed to the Directory.  -->
    <ramBufferSizeMB>64</ramBufferSizeMB>
    <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
         Lucene will flush based on whichever limit is hit first.
      -->
    <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->

    <maxFieldLength>2147483647</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>

    <lockType>native</lockType>
  </indexDefaults>

2012/3/11 Peyman Faratin <peyman@robustlinks.com>

> Hi
>
> I am trying to index 12MM docs faster than is currently happening in Solr
> (using solrj). We have identified solr's add method as the bottleneck (and
> not commit - which is tuned ok through mergeFactor and maxRamBufferSize and
> jvm ram).
>
> Adding 1000 docs is taking approximately 25 seconds. We are making sure we
> add and commit in batches. And we've tried both CommonsHttpSolrServer and
> EmbeddedSolrServer (assuming removing http overhead would speed things up
> with embedding) but the differences is marginal.
>
> The docs being indexed are on average 20 fields long, mostly indexed but
> none stored. The major size contributors are two fields:
>
>        - content, and
>        - shingledContent (populated using copyField of content).
>
> The length of the content field is (likely) gaussian distributed (few
> large docs 50-80K tokens, but majority around 2k tokens). We use
> shingledContent to support phrase queries and content for unigram queries
> (following the advice of Solr Enterprise search server advice - p. 305,
> section "The Solution: Shingling").
>
> Clearly the size of the docs is a contributor to the slow adds (confirmed
> by removing these 2 fields resulting in halving the indexing time). We've
> tried compressed=true also but that is not working.
>
> Any guidance on how to support our application logic (without having to
> change the schema too much) and speed the indexing speed (from current 212
> days for 12MM docs) would be much appreciated.
>
> thank you
>
> Peyman
>
>

--f46d043c81bc2de9c004baf84e9d--