Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of noble.paul@gmail.com designates
 209.85.217.164 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=GNBYxl6ICboitZBA6k5XW93830WkEZlZLBJDHo2B/M+/hq9F/DbrQPXZm/NBhiWSWh
         dKFR6SKcD/PCR3klGwFCQYLo6+HvOQGljUs/nfOB+Do4LuY/cSaHSMCn8ltjiix7rpqy
         ohy2CfJcB1IN0AM/eqFBllKDq8npqwHVv4yiE=
MIME-Version: 1.0
In-Reply-To: <458448.98864.qm@web50309.mail.re2.yahoo.com>
References: <a206fb7e0903271258o355d2211ka2ff94eabbe141fe@mail.gmail.com>
	<458448.98864.qm@web50309.mail.re2.yahoo.com>
Date: Sun, 29 Mar 2009 11:12:19 +0530
Message-ID: <5e76b0ad0903282242j4b5b2469q8a3a2c717d8732b3@mail.gmail.com>
Subject: Re: How to optimize Index Process?
From: 
 =?UTF-8?B?Tm9ibGUgUGF1bCDgtKjgtYvgtKzgtL/gtLPgtY3igI0gIOCkqOCli+CkrOCljeCks+CljQ==?=
 <noble.paul@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Sat, Mar 28, 2009 at 7:38 AM, Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
>
> Hi,
>
> Answers inlined.
>
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
>> =A0 We have a distributed Solr system (2-3 boxes with each running 2
>> instances of Solr and each Solr instance can write to multiple cores).
>
> Is this really optimal? =A0How many CPU cores do your boxes have vs. the =
number of Solr cores?
>
>> Our use case is high index volume - we can get up to 100 million
>> records (1 record =3D 500 bytes) per day, but very low query traffic
>> (only administrators may need to search for data - once an hour our
>> so). So, we need very fast index time. Here are the things I'm trying
>> to find out in order to optimize our index process,
>
> It's tarting to sound like you might be able to batch your data and use h=
ttp://wiki.apache.org/solr/UpdateCSV -- it's the fastest indexing method, I=
 believe.

does CSV work with multivalued field?
If not, using SolrJ with BinaryRequestWriter is quite fast
>
>> 1) What's the optimum index size? I've noticed as the index size grows
>> the indexing time starts increasing. In our test less than 10G index
>> size we could index over 2K/sec, but as it grows over 20G the index
>> rate drops to 1400/sec and keeps dropping as index size grows. I'm
>> trying to see whether we can partition (create new SolrCore) after
>> 10G.
>
> That's likely due to Lucene's segment merging. You can make mergeFactor b=
igger to make segment merging less frequent, but don't make it to high or y=
ou'll run into open file descriptor limits (which you could raise, of cours=
e).
>
>> =A0 =A0 =A0- related question, is there a way to find the SolrCore size =
(any
>> web service for that?) - based on that information I can create a new
>> core and freeze the one which has reached 10G.
>
> You can see the number of docs in an index via Admin Statistics page (the=
 response is actually XML, look at the source)
>
>> 2) In our test, we noticed that after few hours (after 8 hours of
>> indexing) there is a period (3-4 hours period) where the indexing is
>> very-very slow (like 500 records/sec) and after that period indexing
>> returns back to normal rate (1500/sec). Does Solr run any optimize
>> command on its own? How can we find that out? =A0I'm not issuing any
>> optimize command - should I be doing that after certain time?
>
> No, it doesn't run optimize on its own. =A0It could be running auto-commi=
t, but you should comment that out anyway. =A0Try doing a thread dump to se=
e what's doing on and watching the system with top, vmstat.
> No, you shouldn't optimize until you are completely done.
>
>> 3) Every time I add new documents (10K at once) to the index I see
>> searcher closing and then re-opening/re-warming (in Catalina.out)
>> after commit is done. I'm not sure if this is an expensive operation.
>> Since, our search volume is very low can I configure Solr to not do
>> this? Would it make indexing any faster?
>
> Are you running the commit command after every 10K docs? =A0No need to do=
 that if you don't need your searcher to see the changes immediately.
>
>> Mar 26, 2009 11:59:45 PM org.apache.solr.search.SolrIndexSearcher close
>> INFO: Closing Searcher@33d9337c main
>> Mar 26, 2009 11:59:52 PM org.apache.solr.update.DirectUpdateHandler2 com=
mit
>> INFO: start commit(optimize=3Dfalse,waitFlush=3Dfalse,waitSearcher=3Dtru=
e)
>> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher
>> INFO: Opening Searcher@46ba6905 main
>> Mar 26, 2009 11:59:52 PM org.apache.solr.search.SolrIndexSearcher warm
>> INFO: autowarming Searcher@46ba6905 main from Searcher@5c5ffecd main
>>
>> 4) Anything else (any other configuration in Solr - I'm currently
>> using all default settings in the solrconfig.xml and default handlers)
>> that could help optimize my indexing process?
>
> Increase ramBufferSizeMB as much as you can afford.
> Comment out maxBufferedDocs, it's deprecated.
> Increase mergeFactor slightly.
> Consider the CSV approach.
> Index with multiple threads (match the number of CPU cores).
> If you are using Solrj, use the Streaming version of SolrServer.
> Give the JVM more memory (you'll need it if you increase ramBufferSizeMB)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>


--=20
--Noble Paul