Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Date: Tue, 26 Apr 2011 20:01:28 -0400
From: Charles Wardell <charles.wardell@bcsolution.com>
To: solr-user@lucene.apache.org
Message-ID: <98F8535AABE147D9A0BA1DFB143EDE31@bcsolution.com>
In-Reply-To: <757199.45080.qm@web130101.mail.mud.yahoo.com>
References: <3CE71CB6-F237-44AA-9FE2-F04C78A6EDCA@bcsolution.com>
 <757199.45080.qm@web130101.mail.mud.yahoo.com>
Subject: Re: Question on Batch process
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="4db75cd8_77465f01_8d93"
Content-Transfer-Encoding: 8bit

--4db75cd8_77465f01_8d93
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline

Thank you Otis.
Without trying to appear to stupid, when you refer to having the params matching your # of CPU cores, you are talking about the # of threads I can spawn with the StreamingUpdateSolrServer object?
Up until now, I have been just utilizing post.sh or post.jar. Are these capable of that or do I need to write some code to collect a bunch of files into the buffer and send it off?

Also, Do you have a sense for how long it should take to index 100,000 files or in my case 100,000,000 documents?
StreamingUpdateSolrServer
public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int threadCount) throws MalformedURLException

Thanks again,
Charlie

-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
> Charlie,
> 
> How's this:
> * -Xmx2g
> * ramBufferSizeMB 512
> * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
> * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> * use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
> or send batches of say 1000 docs with the other SolrServer impl using N threads 
> (N=# of your CPU cores)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Charles Wardell <charles.wardell@bcsolution.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 2:32:29 PM
> > Subject: Question on Batch process
> > 
> > I am sure that this question has been asked a few times, but I can't seem to 
> > find the sweetspot for indexing.
> > 
> > I have about 100,000 files each containing 1,000 xml documents ready to be 
> > posted to Solr. My desire is to have it index as quickly as possible and then 
> > once completed the daily stream of ADDs will be small in comparison.
> > 
> > The individual documents are small. Essentially web postings from the net. 
> > Title, postPostContent, date. 
> > 
> > 
> > What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> > MaxbufferedDocs, etc..
> > 
> > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
> > I have 16GB of available ram.
> > 
> > 
> > Thanks in advance.
> > Charlie
> 

--4db75cd8_77465f01_8d93--