lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alok Dhir <ad...@symplicity.com>
Subject sizing/sanity check for huge(?) dataset
Date Fri, 28 Dec 2007 20:36:50 GMT
Hey all -- first, thanks to the solr & lucene teams for fantastic  
products.  So far we're very pleased with the results we're seeing  
from them.  We're looking at it as the primary search solution for a  
rather large dataset.  Hoping for a comments/sanity check from people  
who "know".

Looking at a deploying solr tp search around 100M docs, totalling  
around 165G of space.  Would this be considered "huge"?  It seems so  
given the posts I've read on the list.  In any case...

Schema currently looks as follows (using type definitions from  
"example" schema.xml):

<field name="instance" type="string" indexed="true" stored="true"  
required="true" />
<field name="instance_id" type="string" indexed="true" stored="true"  
required="true" />
<field name="id" type="string" indexed="true" stored="true"  
required="true" />
<field name="label" type="text" indexed="true" stored="true"  
required="true" />
<field name="textbody" type="text" indexed="true" stored="true" />
<field name="domain" type="string" indexed="true" stored="true"  
multiValued="true" omitNorms="true"/>
<field name="subdomain" type="string" indexed="true" stored="true"  
multiValued="true" omitNorms="true"/>
<field name="category" type="string" indexed="true" stored="true"  
multiValued="true" omitNorms="true"/>
<field name="dt" type="date" indexed="true" stored="true"  
multiValued="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"  
default="NOW" multiValued="false"/>
<field name="class" type="string" multiValued="true" indexed="true" />
<field name="class_id" type="string" multiValued="true"  
indexed="true" />
<field name="tags" type="string" indexed="true" stored="true"  
multiValued="true"/>
<field name="level" type="sint" indexed="true" stored="true"  
default="0"/>
<field name="user" type="string" indexed="true" stored="true"/>

Note, this is a completely unoptimized schema -- knocked out quick for  
a proof of concept.  Many fields here will be used for faceting.

The only field in this schema which will be more than a line or so  
(call it 80 bytes) is the 'textbody' field which could be up to a few  
KB - call it an average of 1K or less.  All "id/class" fields will be  
32B or less.

The 100M docs @ 165GB is a projection from having indexed 1/500th of  
the intended dataset.  It will not vary by more than 25% and will not  
grow over time (we will be removing entries older than X days as part  
of ongoing maintenance).

The servers we're spec'ing are 8 core, 8 gig machines, with a SAN for  
storage.  The servers will be load balanced for performance and  
availability (i.e. if one box is dead, searches don't stop -- they  
just slow down a bit).  Indexing will occur incrementally, as  
transactions occur in a related set of applications.  There will  
rarely be a need for a focused "indexing" process after the initial  
app rollout.

Thanks for any comments or suggestions.

Al


Mime
View raw message