lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: 30 milllion+ docs on a single server
Date Tue, 15 Aug 2006 16:51:15 GMT
Thanks for all of the useful info on this topic. You have been very 
enlightening. My RAM requirements where obviously off the mark. Here is 
my current understanding of this issue:

A standard 32-bit processor has access to 4GB of RAM. If your CPU 
supports Physical Address Extension (PAE) the OS can access up to 64GB 
of RAM, although a single application is still limited to 4GB. In 
Windows, Address Windowing Extensions (AWE) solves this limitation but 
you must use OS specific calls to manage your memory. Unix/Linux has 
it's own memory mapping scheme that accomplishes something similar to AWE.

 I have no problem requiring that the server i use be maxed to the 
gills. I just want to support Windows as well as Unix/Linux/Sun. I was 
shortsighted in my RAM comments.

As far as Solr: it looks like this will not necessarily solve my problem 
of handling an index so big on a single server--the entire index is 
replicated across each slave searcher--but if load is a factor 
(according to Hoss, load will probably be the big issue, not the size of 
the index) than this distribution is exactly what I need. I think that 
the average load I can expect will be pretty low though. Updates to the 
large index will also be relatively rare. 30-200 items added every 
morning with the occasional update scattered throughout the day. There 
will only be a few hundred searchers in total and they will not be 
hammering the system but consulting it here and there throughout the day.

When I said that Solr required hardlinks I was referring to the 
replication feature. My fault--replication was the benefit I was seeing 
from solar and I was unclear. I bet solr's caching helps a lot too and I 
will be looking into that. What I did not know was that Windows supports 
hard links when using ntfs. Fsutil.exe creates them. This gives me hope 
that cygwin might support them. I do not know if  I can use the same cp 
-l -r trick, but it at least it looks hopeful. I did not realize that 
solr's replication feature was so external and easily changeable.

as far as rmi: splitting the index across several boxes seems relatively 
easy (the index can be naturally partitioned without much trouble) but I 
would still have to deal with sorting on a single box.  I am not so 
worried about that anymore though. Warming a new searcher and tons of 
RAM should handle my sorts in an acceptable way.

It seems to me that one way or another I can make this happen.

I am  wondering whether  I should try and use a RAM directory. While I 
am sure it might be faster, what happens when the power goes out? System 
reboots? I would have to occasionally flush to disk anyway. Would it be 
better to use a normal directory and turn the maxbuffered docs way up?


- Mark

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message