lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Susheel Kumar <susheel.ku...@thedigitalgroup.net>
Subject RE: Solr server requirements for 100+ million documents
Date Sun, 26 Jan 2014 05:06:01 GMT
Hi Kranti,

Attach are the solrconfig & schema xml for review. I did run indexing with just few fields
(5-6 fields) in schema.xml & keeping the same db config but Indexing almost still taking
similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the
data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler
/ jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader
and then index it. Any thoughts? 

Thnx

-----Original Message-----
From: Kranti Parisa [mailto:kranti.parisa@gmail.com] 
Sent: Saturday, January 25, 2014 12:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

can you post the complete solrconfig.xml file and schema.xml files to review all of your settings
that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < susheel.kumar@thedigitalgroup.net>
wrote:

> Thanks, Svante. Your indexing speed using db seems to really fast. Can 
> you please provide some more detail on how you are indexing db 
> records. Is it thru DataImportHandler? And what database? Is that 
> local db?  We are indexing around 70 fields (60 multivalued) but data 
> is not populated always in all fields. The average size of document is in 5-10 kbs.
>
> -----Original Message-----
> From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf 
> Of svante karlsson
> Sent: Friday, January 24, 2014 5:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> I just indexed 100 million db docs (records) with 22 fields (4
> multivalued) in 9524 sec using libcurl.
> 11 million took 763 seconds so the speed drops somewhat with 
> increasing dbsize.
>
> We write 1000 docs (just an arbitrary number) in each request from two 
> threads. If you will be using solrcloud you will want more writer threads.
>
> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one 
> SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>
> /svante
>
>
>
>
> 2014/1/24 Susheel Kumar <susheel.kumar@thedigitalgroup.net>
>
> > Thanks, Erick for the info.
> >
> > For indexing I agree the more time is consumed in data acquisition 
> > which in our case from Database.  For indexing currently we are 
> > using the manual process i.e. Solr dashboard Data Import but now 
> > looking to automate.  How do you suggest to automate the index part. 
> > Do you recommend to use SolrJ or should we try to automate using Curl?
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerickson@gmail.com]
> > Sent: Friday, January 24, 2014 2:59 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Can't be done with the information you provided, and can only be 
> > guessed at even with more comprehensive information.
> >
> > Here's why:
> >
> >
> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
> > we
> > -dont-have-a-definitive-answer/
> >
> > Also, at a guess, your indexing speed is so slow due to data 
> > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > If you're using SolrJ, try commenting out the
> > server.add() bit and running again. My guess is that your indexing 
> > speed will be almost unchanged, in which case it's the data 
> > acquisition process is where you should concentrate efforts. As a 
> > comparison, I can index 11M Wikipedia docs on my laptop in 45 
> > minutes without any attempts at parallelization.
> >
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
> > susheel.kumar@thedigitalgroup.net> wrote:
> > > Hi,
> > >
> > > Currently we are indexing 10 million document from database (10 db 
> > > data
> > entities) & index size is around 8 GB on windows virtual box. 
> > Indexing in one shot taking 12+ hours while indexing parallel in 
> > separate cores & merging them together taking 4+ hours.
> > >
> > > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a 
> > Production environment. There can be 200+ users performing search 
> > same
> time.
> > >
> > > No of physical servers (considering solr cloud) Memory requirement 
> > > Processor requirement (# cores) Linux as OS oppose to windows
> > >
> > > Thanks in advance.
> > > Susheel
> > >
> >
>

Mime
View raw message