Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 9794 invoked from network); 29 Mar 2009 05:43:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Mar 2009 05:43:06 -0000 Received: (qmail 6774 invoked by uid 500); 29 Mar 2009 05:43:05 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 6686 invoked by uid 500); 29 Mar 2009 05:43:05 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 6676 invoked by uid 99); 29 Mar 2009 05:43:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Mar 2009 05:43:04 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of noble.paul@gmail.com designates 209.85.217.164 as permitted sender) Received: from [209.85.217.164] (HELO mail-gx0-f164.google.com) (209.85.217.164) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Mar 2009 05:42:56 +0000 Received: by gxk8 with SMTP id 8so3494115gxk.5 for ; Sat, 28 Mar 2009 22:42:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=gysBAen/fIbX/THuG7KB0dXs42zIy8ZUNEm20KbMSgM=; b=QmicGZYiMHg2dFWi0l8pG6Eruj6p7jbr32N/Zs8pXLa+tId04ZxBdYVX70NJOudx1p eHRqlQbwXNcbijVAaX9mKav/wtjVt1oH+KjgAdVM9StM5noWTRgsGFIiseyCrLDfd+u2 jXn6zx/+kUlYka0NvmIuYfG26EK9hy1zi71fc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=GNBYxl6ICboitZBA6k5XW93830WkEZlZLBJDHo2B/M+/hq9F/DbrQPXZm/NBhiWSWh dKFR6SKcD/PCR3klGwFCQYLo6+HvOQGljUs/nfOB+Do4LuY/cSaHSMCn8ltjiix7rpqy ohy2CfJcB1IN0AM/eqFBllKDq8npqwHVv4yiE= MIME-Version: 1.0 In-Reply-To: <458448.98864.qm@web50309.mail.re2.yahoo.com> References: <458448.98864.qm@web50309.mail.re2.yahoo.com> Date: Sun, 29 Mar 2009 11:12:19 +0530 Received: by 10.231.14.141 with SMTP id g13mr812493iba.56.1238305354797; Sat, 28 Mar 2009 22:42:34 -0700 (PDT) Message-ID: <5e76b0ad0903282242j4b5b2469q8a3a2c717d8732b3@mail.gmail.com> Subject: Re: How to optimize Index Process? From: =?UTF-8?B?Tm9ibGUgUGF1bCDgtKjgtYvgtKzgtL/gtLPgtY3igI0gIOCkqOCli+CkrOCljeCks+CljQ==?= To: solr-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Sat, Mar 28, 2009 at 7:38 AM, Otis Gospodnetic wrote: > > Hi, > > Answers inlined. > > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- >> =A0 We have a distributed Solr system (2-3 boxes with each running 2 >> instances of Solr and each Solr instance can write to multiple cores). > > Is this really optimal? =A0How many CPU cores do your boxes have vs. the = number of Solr cores? > >> Our use case is high index volume - we can get up to 100 million >> records (1 record =3D 500 bytes) per day, but very low query traffic >> (only administrators may need to search for data - once an hour our >> so). So, we need very fast index time. Here are the things I'm trying >> to find out in order to optimize our index process, > > It's tarting to sound like you might be able to batch your data and use h= ttp://wiki.apache.org/solr/UpdateCSV -- it's the fastest indexing method, I= believe. does CSV work with multivalued field? If not, using SolrJ with BinaryRequestWriter is quite fast > >> 1) What's the optimum index size? I've noticed as the index size grows >> the indexing time starts increasing. In our test less than 10G index >> size we could index over 2K/sec, but as it grows over 20G the index >> rate drops to 1400/sec and keeps dropping as index size grows. I'm >> trying to see whether we can partition (create new SolrCore) after >> 10G. > > That's likely due to Lucene's segment merging. You can make mergeFactor b= igger to make segment merging less frequent, but don't make it to high or y= ou'll run into open file descriptor limits (which you could raise, of cours= e). > >> =A0 =A0 =A0- related question, is there a way to find the SolrCore size = (any >> web service for that?) - based on that information I can create a new >> core and freeze the one which has reached 10G. > > You can see the number of docs in an index via Admin Statistics page (the= response is actually XML, look at the source) > >> 2) In our test, we noticed that after few hours (after 8 hours of >> indexing) there is a period (3-4 hours period) where the indexing is >> very-very slow (like 500 records/sec) and after that period indexing >> returns back to normal rate (1500/sec). Does Solr run any optimize >> command on its own? How can we find that out? =A0I'm not issuing any >> optimize command - should I be doing that after certain time? > > No, it doesn't run optimize on its own. =A0It could be running auto-commi= t, but you should comment that out anyway. =A0Try doing a thread dump to se= e what's doing on and watching the system with top, vmstat. > No, you shouldn't optimize until you are completely done. > >> 3) Every time I add new documents (10K at once) to the index I see >> searcher closing and then re-opening/re-warming (in Catalina.out) >> after commit is done. I'm not sure if this is an expensive operation. >> Since, our search volume is very low can I configure Solr to not do >> this? Would it make indexing any faster? > > Are you running the commit command after every 10K docs? =A0No need to do= that if you don't need your searcher to see the changes immediately. > >> Mar 26, 2009 11:59:45 PM org.apache.solr.search.SolrIndexSearcher close >> INFO: Closing Searcher@33d9337c main >> Mar 26, 2009 11:59:52 PM org.apache.solr.update.DirectUpdateHandler2 com= mit >> INFO: start commit(optimize=3Dfalse,waitFlush=3Dfalse,waitSearcher=3Dtru= e) >> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher >> INFO: Opening Searcher@46ba6905 main >> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher warm >> INFO: autowarming Searcher@46ba6905 main from Searcher@5c5ffecd main >> >> 4) Anything else (any other configuration in Solr - I'm currently >> using all default settings in the solrconfig.xml and default handlers) >> that could help optimize my indexing process? > > Increase ramBufferSizeMB as much as you can afford. > Comment out maxBufferedDocs, it's deprecated. > Increase mergeFactor slightly. > Consider the CSV approach. > Index with multiple threads (match the number of CPU cores). > If you are using Solrj, use the Streaming version of SolrServer. > Give the JVM more memory (you'll need it if you increase ramBufferSizeMB) > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > --=20 --Noble Paul