lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: 30 milllion+ docs on a single server
Date Sun, 13 Aug 2006 02:04:10 GMT

: Frustrated is the word :) I have looked at Solr...what I am worried
: about there is this: Solr says it requires an OS that supports hard
: links. Currently Windows does not to my knowledge. Someone seemed to
: make a comment that Windows could be supported...from what I know I
: don't think so. Not a deal breaker per say but then there is this: I

Solr does not require hardlinks, what the FAQ says is...

  "The Replication features of Solr currently require an OS with the
  ability to create hard links and rsync."

...which means if you want to use the replication system provided with
Solr as is, you need hardlinks and rsync.  Solr is designed with the
replication as a very external portion of the system (it's just executing
shell calls specific in a config file) so it should be possible to plug in
a different replication system and use the existing hooks for generating
snapshoots on the master and loading snapshots on the slave ... it just
hasn't been a priority.

: have done a lot with the lucene API. I have created a custom query
: language to lucene query parser. I have changed the standard parser. I
: have made heavy use of Multi-Searchers. I am really tied into the Lucene
: API. I am worried about how easy it will be to integrate that into Solr.

Anything thing you do at "query time" with the Lucene API, can be done in
a SolrRequestHandler (which you write in Java and register in the solr
config file) -- change just a few method calls and you'll get a lot of
great caching features as well.

none of which really addresses the crux of your question....

: Can I index 30 million+ docs that range in size form 2-10kb on a single
: server in a Windows environment (accesss to a max of about 1.5 gig of
: RAM). The average search will need to be sorted by field not relevancy.

I can't say that I've personally built/used a lucene index of 30 million
docs, but i have talked to people who have done it .. they certainly had
some performance issues, but those issues were mainly related to the
volume of queries they got, not so much the size of their index.  That
said: you are seriously hindering yourself with the Windows/RAM (The
FieldCache for your sort field (assuming it's an int) alone will be over
100MB), not to mention the fact that your index isn't static, so creating
a new searcher after you've made updates esentially halves the amount of
usable RAM you have to work with unless you're willing to close one
searcher before you're willing to open the new one.

I haven't played with Remote/Multi Searchers, but perhaps you should
open yourself up to the possibility of partitionining your index on
several boxes.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message