lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McKinley <ryan...@gmail.com>
Subject Re: Solr feasibility with terabyte-scale data
Date Sat, 19 Jan 2008 19:09:07 GMT
> 
> We are considering Solr 1.2 to index and search a terabyte-scale dataset 
> of OCR.  Initially our requirements are simple: basic tokenizing, score 
> sorting only, no faceting.   The schema is simple too.  A document 
> consists of a numeric id, stored and indexed and a large text field, 
> indexed not stored, containing the OCR typically ~1.4Mb.  Some limited 
> faceting or additional metadata fields may be added later.

I have not done anything on this scale...  but with:
https://issues.apache.org/jira/browse/SOLR-303 it will be possible to 
split a large index into many smaller indices and return the union of 
all results.  This may or may not be necessary depending on what the 
data actually looks like (if you text just uses 100 words, your index 
may not be that big)

How many documents are you talking about?

> 
> Should we expect Solr indexing time to slow significantly as we scale 
> up?  What kind of query performance could we expect?  Is it totally 
> naive even to consider Solr at this kind of scale?
> 

You may want to check out the lucene benchmark stuff
http://lucene.apache.org/java/docs/benchmarks.html

http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/benchmark/byTask/package-summary.html


ryan



Mime
View raw message