lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From svante karlsson <s...@csi.se>
Subject Re: Solr server requirements for 100+ million documents
Date Sat, 25 Jan 2014 11:51:28 GMT
That got away a little early...

The inserter is a small C++ program that uses pglib to speek to postgres
and the a http-client library that uses libcurl under the hood. The
inserter draws very little CPU and we normally use 2 writer threads that
each posts 1000 records at a time. Its very inefficient to post one at a
time but I've not done any specific testing to know if 1000 is better that
500....

What we're doing now is trying to figure out how to get the query
performance up since is not where we need it to be so we're not done
either...


2014/1/25 svante karlsson <saka@csi.se>

> We are using a postgres server on a different host (same hardware as the
> test solr server). The reason we take the data from the postgres server is
> that is easy to automate testing since we use the same server to produce
> queries. In production we preload the solr from a csv file from a hive
> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
> use solr as NoSQL dabase since we really want to do SHOULD queries against
> all the fields. The fields are typically very small text fields (<30 chars)
> but occasionally bigger but I don't think I have more than 128 chars on
> anything in the whole dataset.
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <schema name="example" version="1.1">
>   <types>
>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="boolean" class="solr.BoolField"
> sortMissingLast="true"/>
>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
> positionIncrementGap="0"/>
>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> positionIncrementGap="0"/>
>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0"/>
>    </types>
> <fields>
> <field name="_version_" type="long" indexed="true" stored="true"
> multiValued="false"/>
> <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="name" type="int" indexed="true" stored="true"/>
> <field name="fieldA" type="string" indexed="true" stored="true"/>
> <field name="fieldB" type="string" indexed="true" stored="true"/>
> <field name="fieldC" type="int" indexed="true" stored="true"/>
> <field name="fieldD" type="int" indexed="true" stored="true"/>
> <field name="fieldE" type="int" indexed="true" stored="true"/>
> <field name="fieldF" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldG" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldH" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldI" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldJ" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldK" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldL" type="string" indexed="true" stored="true"/>
> <field name="fieldM" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="fieldN" type="string" indexed="true" stored="true"/>
>
> <field name="fieldO" type="string" indexed="false" stored="true"
> required="false" />
> <field name="ts"  type="long" indexed="true" stored="true"/>
> </fields>
> <uniqueKey>id</uniqueKey>
> <solrQueryParser defaultOperator="OR"/>
> </schema>
>
>
>
>
>
> 2014/1/25 Kranti Parisa <kranti.parisa@gmail.com>
>
>> can you post the complete solrconfig.xml file and schema.xml files to
>> review all of your settings that would impact your indexing performance.
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>> susheel.kumar@thedigitalgroup.net> wrote:
>>
>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>> you
>> > please provide some more detail on how you are indexing db records. Is
>> it
>> > thru DataImportHandler? And what database? Is that local db?  We are
>> > indexing around 70 fields (60 multivalued) but data is not populated
>> always
>> > in all fields. The average size of document is in 5-10 kbs.
>> >
>> > -----Original Message-----
>> > From: saka.csi.se@gmail.com [mailto:saka.csi.se@gmail.com] On Behalf Of
>> > svante karlsson
>> > Sent: Friday, January 24, 2014 5:05 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > I just indexed 100 million db docs (records) with 22 fields (4
>> > multivalued) in 9524 sec using libcurl.
>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>> > dbsize.
>> >
>> > We write 1000 docs (just an arbitrary number) in each request from two
>> > threads. If you will be using solrcloud you will want more writer
>> threads.
>> >
>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>> SSD
>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>> machine.
>> >
>> > /svante
>> >
>> >
>> >
>> >
>> > 2014/1/24 Susheel Kumar <susheel.kumar@thedigitalgroup.net>
>> >
>> > > Thanks, Erick for the info.
>> > >
>> > > For indexing I agree the more time is consumed in data acquisition
>> > > which in our case from Database.  For indexing currently we are using
>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>> > > automate.  How do you suggest to automate the index part. Do you
>> > > recommend to use SolrJ or should we try to automate using Curl?
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Erick Erickson [mailto:erickerickson@gmail.com]
>> > > Sent: Friday, January 24, 2014 2:59 PM
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Solr server requirements for 100+ million documents
>> > >
>> > > Can't be done with the information you provided, and can only be
>> > > guessed at even with more comprehensive information.
>> > >
>> > > Here's why:
>> > >
>> > >
>> > >
>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
>> > > -dont-have-a-definitive-answer/
>> > >
>> > > Also, at a guess, your indexing speed is so slow due to data
>> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > > If you're using SolrJ, try commenting out the
>> > > server.add() bit and running again. My guess is that your indexing
>> > > speed will be almost unchanged, in which case it's the data
>> > > acquisition process is where you should concentrate efforts. As a
>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
>> > > without any attempts at parallelization.
>> > >
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>> > > susheel.kumar@thedigitalgroup.net> wrote:
>> > > > Hi,
>> > > >
>> > > > Currently we are indexing 10 million document from database (10 db
>> > > > data
>> > > entities) & index size is around 8 GB on windows virtual box. Indexing
>> > > in one shot taking 12+ hours while indexing parallel in separate cores
>> > > & merging them together taking 4+ hours.
>> > > >
>> > > > We are looking to scale to 100+ million documents and looking for
>> > > recommendation on servers requirements on below parameters for a
>> > > Production environment. There can be 200+ users performing search same
>> > time.
>> > > >
>> > > > No of physical servers (considering solr cloud) Memory requirement
>> > > > Processor requirement (# cores) Linux as OS oppose to windows
>> > > >
>> > > > Thanks in advance.
>> > > > Susheel
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message