Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 478A39BD4 for ; Sun, 11 Mar 2012 14:19:23 +0000 (UTC) Received: (qmail 8203 invoked by uid 500); 11 Mar 2012 14:19:19 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 8093 invoked by uid 500); 11 Mar 2012 14:19:19 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 8080 invoked by uid 99); 11 Mar 2012 14:19:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Mar 2012 14:19:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yydzero@gmail.com designates 209.85.212.182 as permitted sender) Received: from [209.85.212.182] (HELO mail-wi0-f182.google.com) (209.85.212.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Mar 2012 14:19:14 +0000 Received: by wibhr14 with SMTP id hr14so1930160wib.5 for ; Sun, 11 Mar 2012 07:18:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=5jKmbkgtcBXnWEVs7vTW6HML9Ty1zaH7Kq0r7TUBnFo=; b=Bc1tpdgn+1amHQFqtWbFKbyiG5NWeklroLyP9+9ln8C6pTyZ2Jf22DxfCAdc7GJlNm JoZa3iPqZECrEZwYfyAi/Xu/bUp1bx8Ez4XQKHy4ijatwdbkDTioCuGIKyyl0B7YvHQu v059aYj/F7syMIIWQ9Zu+qqeUgXZQfjxZD9Yc0Lm2l4rsp66Z3otpPoHH2moCV9F6WKs MD+Bqscd+wVOHjCwfMBjC8/JS8mn6pQg/rcP+psYRc4Z611BzQNY2FvUUqYXOC60027D 3d9B9mGnHyvJLgikQyda3YhXGNebm21EPMHwlp0u5fNBEO/iTWEBw1cZIBOSkWmO3pd3 tfMA== MIME-Version: 1.0 Received: by 10.180.78.233 with SMTP id e9mr19935091wix.0.1331475533130; Sun, 11 Mar 2012 07:18:53 -0700 (PDT) Received: by 10.227.60.137 with HTTP; Sun, 11 Mar 2012 07:18:53 -0700 (PDT) In-Reply-To: <9297F6B1-867C-4F9D-8D45-CC53868046A3@robustlinks.com> References: <9297F6B1-867C-4F9D-8D45-CC53868046A3@robustlinks.com> Date: Sun, 11 Mar 2012 22:18:53 +0800 Message-ID: Subject: Re: Faster Solr Indexing From: Yandong Yao To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d043c81bc2de9c004baf84e9d X-Virus-Checked: Checked by ClamAV on apache.org --f46d043c81bc2de9c004baf84e9d Content-Type: text/plain; charset=ISO-8859-1 I have similar issues by using DIH, and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) consumes most of the time when indexing 10K rows (each row is about 70K) - DIH nextRow takes about 10 seconds totally - If index uses whitespace tokenizer and lower case filter, then addDoc() methods takes about 80 seconds - If index uses whitespace tokenizer, lower case filer, WDF, then addDoc uses about 112 seconds - If index uses whitespace tokenizer, lower case filer, WDF and porter stemmer, then addDoc uses about 145 seconds We have more than million rows totally, and am wondering whether i am using sth. wrong or is there any way to improve the performance of addDoc()? Thanks very much in advance! Following is the configure: 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m 2) Solr version 3.5 3) solrconfig.xml (almost copied from solr's example/solr directory.) false 10 64 2147483647 1000 10000 native 2012/3/11 Peyman Faratin > Hi > > I am trying to index 12MM docs faster than is currently happening in Solr > (using solrj). We have identified solr's add method as the bottleneck (and > not commit - which is tuned ok through mergeFactor and maxRamBufferSize and > jvm ram). > > Adding 1000 docs is taking approximately 25 seconds. We are making sure we > add and commit in batches. And we've tried both CommonsHttpSolrServer and > EmbeddedSolrServer (assuming removing http overhead would speed things up > with embedding) but the differences is marginal. > > The docs being indexed are on average 20 fields long, mostly indexed but > none stored. The major size contributors are two fields: > > - content, and > - shingledContent (populated using copyField of content). > > The length of the content field is (likely) gaussian distributed (few > large docs 50-80K tokens, but majority around 2k tokens). We use > shingledContent to support phrase queries and content for unigram queries > (following the advice of Solr Enterprise search server advice - p. 305, > section "The Solution: Shingling"). > > Clearly the size of the docs is a contributor to the slow adds (confirmed > by removing these 2 fields resulting in halving the indexing time). We've > tried compressed=true also but that is not working. > > Any guidance on how to support our application logic (without having to > change the schema too much) and speed the indexing speed (from current 212 > days for 12MM docs) would be much appreciated. > > thank you > > Peyman > > --f46d043c81bc2de9c004baf84e9d--