lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: sizing/sanity check for huge(?) dataset
Date Sat, 29 Dec 2007 01:30:33 GMT
Hi Al,

165GB of disk space and 100M is big, but not impossibly huge.

The only thing that looks a bit worrisome is the index size : RAM ratio.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Alok Dhir <adhir@symplicity.com>
To: solr-user@lucene.apache.org
Sent: Friday, December 28, 2007 3:40:22 PM
Subject: Re: sizing/sanity check for huge(?) dataset

Missed one bit of data:  This dataset will be searched less than 500  
times per day.  The goal is to get results in a reasonable amount of  
time (<3s), but the queries coming in per minute will likely max out  
around 10.

On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote:

> Hey all -- first, thanks to the solr & lucene teams for fantastic  
> products.  So far we're very pleased with the results we're seeing  
> from them.  We're looking at it as the primary search solution for a
  
> rather large dataset.  Hoping for a comments/sanity check from  
> people who "know".
>
> Looking at a deploying solr tp search around 100M docs, totalling  
> around 165G of space.  Would this be considered "huge"?  It seems so
  
> given the posts I've read on the list.  In any case...
>
> Schema currently looks as follows (using type definitions from  
> "example" schema.xml):
>
> <field name="instance" type="string" indexed="true" stored="true"  
> required="true" />
> <field name="instance_id" type="string" indexed="true" stored="true"
  
> required="true" />
> <field name="id" type="string" indexed="true" stored="true"  
> required="true" />
> <field name="label" type="text" indexed="true" stored="true"  
> required="true" />
> <field name="textbody" type="text" indexed="true" stored="true" />
> <field name="domain" type="string" indexed="true" stored="true"  
> multiValued="true" omitNorms="true"/>
> <field name="subdomain" type="string" indexed="true" stored="true"  
> multiValued="true" omitNorms="true"/>
> <field name="category" type="string" indexed="true" stored="true"  
> multiValued="true" omitNorms="true"/>
> <field name="dt" type="date" indexed="true" stored="true"  
> multiValued="false"/>
> <field name="timestamp" type="date" indexed="true" stored="true"  
> default="NOW" multiValued="false"/>
> <field name="class" type="string" multiValued="true" indexed="true"
 />
> <field name="class_id" type="string" multiValued="true"  
> indexed="true" />
> <field name="tags" type="string" indexed="true" stored="true"  
> multiValued="true"/>
> <field name="level" type="sint" indexed="true" stored="true"  
> default="0"/>
> <field name="user" type="string" indexed="true" stored="true"/>
>
> Note, this is a completely unoptimized schema -- knocked out quick  
> for a proof of concept.  Many fields here will be used for faceting.
>
> The only field in this schema which will be more than a line or so  
> (call it 80 bytes) is the 'textbody' field which could be up to a  
> few KB - call it an average of 1K or less.  All "id/class" fields  
> will be 32B or less.
>
> The 100M docs @ 165GB is a projection from having indexed 1/500th of
  
> the intended dataset.  It will not vary by more than 25% and will  
> not grow over time (we will be removing entries older than X days as
  
> part of ongoing maintenance).
>
> The servers we're spec'ing are 8 core, 8 gig machines, with a SAN  
> for storage.  The servers will be load balanced for performance and  
> availability (i.e. if one box is dead, searches don't stop -- they  
> just slow down a bit).  Indexing will occur incrementally, as  
> transactions occur in a related set of applications.  There will  
> rarely be a need for a focused "indexing" process after the initial  
> app rollout.
>
> Thanks for any comments or suggestions.
>
> Al
>





Mime
View raw message