lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Wardell <charles.ward...@bcsolution.com>
Subject Re: Question on Batch process
Date Wed, 27 Apr 2011 23:51:20 GMT
Thank you for your response. I did not make the StreamingUpdate application yet, but I did
change the other settings that you mentioned. It gave me a huge boost in indexing speed. (I
am still using post.sh but hope to change that soon).

One thing I noticed is the indexing speed was incredibly fast last night, but today the commits
are taking so long. Is this to be expected?



-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Wednesday, April 27, 2011 at 6:15 PM, Otis Gospodnetic wrote: 
> Hi Charles,
> 
> Yes, the threads I was referring to are in the context of the client/indexer, so 
> one of the params for StreamingUpdateSolrServer.
> post.sh/jar are just there because they are handy. Don't use them for 
> production.
> 
> It's impossible to tell how long indexing of 100M documents may take. They 
> could be very big or very small. You could perform very light or no analysis or 
> heavy analysis. They could contain 1 or 100 fields. :)
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> > From: Charles Wardell <charles.wardell@bcsolution.com>
> > To: solr-user@lucene.apache.org
> > Sent: Tue, April 26, 2011 8:01:28 PM
> > Subject: Re: Question on Batch process
> > 
> > Thank you Otis.
> > Without trying to appear to stupid, when you refer to having the params 
> > matching your # of CPU cores, you are talking about the # of threads I can 
> > spawn with the StreamingUpdateSolrServer object?
> > Up until now, I have been just utilizing post.sh or post.jar. Are these 
> > capable of that or do I need to write some code to collect a bunch of files 
> > into the buffer and send it off?
> > 
> > Also, Do you have a sense for how long it should take to index 100,000 files 
> > or in my case 100,000,000 documents?
> > StreamingUpdateSolrServer
> > public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
> > threadCount) throws MalformedURLException
> > 
> > Thanks again,
> > Charlie
> > 
> > -- 
> > Best Regards,
> > 
> > Charles Wardell
> > Blue Chips Technology, Inc.
> > www.bcsolution.com
> > 
> > On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
> > > Charlie,
> > > 
> > > How's this:
> > > * -Xmx2g
> > > * ramBufferSizeMB 512
> > > * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n 
> > allows)
> > > * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
> > > * use SolrStreamingUpdateServer (with params matching your number of CPU 
> > cores) 
> > 
> > > or send batches of say 1000 docs with the other SolrServer impl using N 
> > threads 
> > 
> > > (N=# of your CPU cores)
> > > 
> > > Otis
> > >  ----
> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > > Lucene ecosystem search :: http://search-lucene.com/
> > > 
> > > 
> > > 
> > > ----- Original Message ----
> > > > From: Charles Wardell <charles.wardell@bcsolution.com>
> > > > To: solr-user@lucene.apache.org
> > > > Sent: Tue, April 26, 2011 2:32:29 PM
> > > > Subject: Question on Batch process
> > > > 
> > > > I am sure that this question has been asked a few times, but I can't seem

> > to 
> > 
> > > > find the sweetspot for indexing.
> > > > 
> > > > I have about 100,000 files each containing 1,000 xml documents ready to
be 
> > 
> > > > posted to Solr. My desire is to have it index as quickly as possible and

> > then 
> > 
> > > > once completed the daily stream of ADDs will be small in comparison.
> > > > 
> > > > The individual documents are small. Essentially web postings from the
net. 
> > 
> > > > Title, postPostContent, date. 
> > > > 
> > > > 
> > > >  What would be the ideal configuration? For RamBufferSize, mergeFactor,

> > > > MaxbufferedDocs, etc..
> > > > 
> > > > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in

> TOP
> > > > I have 16GB of available ram.
> > > > 
> > > > 
> > > > Thanks in advance.
> > > > Charlie
> 

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message