lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <>
Subject Re: Index size and performance degradation
Date Sun, 12 Jun 2011 18:25:54 GMT

Speaking of NRT, and completely off-topic, I know: Lucene's NRT 
apparently isn't fast enough if Zoie was needed, and now that Zoie is 
around are there any plans to make it Lucene's default? or: why would 
one still use NRT when Zoie seem to work much better?


On 12/06/2011 13:16, Michael McCandless wrote:

> Remember that memory-mapping is not a panacea: at the end of the day,
> if there just isn't enough RAM on the machine to keep your full
> "working set" hot, then the OS will have to hit the disk, regardless
> of whether the access is through MMap or a "traditional" IO request.
> That said, on Fedora Linux anyway, I generally see better performance
> from MMap than from NIOFSDir; eg see the 2nd chart here:
> Mike McCandless
> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<>  wrote:
>> Thanks.
>> The whole point of my question was to find out if and how to make balancing
>> on the SAME machine. Apparently thats not going to help and at a certain
>> point we will just have to prompt the user to buy more hardware...
>> Out of curiosity, isn't there anything that we can do to avoid that? for
>> instance using memory-mapped files for the indexes? anything that would help
>> us overcome OS limitations of that sort...
>> Also, you mention a scheduled job to check for performance degradation; any
>> idea how serious such a drop should be for sharding to be really beneficial?
>> or is it application specific too?
>> Itamar.
>> On 12/06/2011 06:43, Shai Erera wrote:
>>> I agree w/ Erick, there is no cutoff point (index size for that matter)
>>> above which you start sharding.
>>> What you can do is create a scheduled job in your system that runs a
>>> select
>>> list of queries and monitors their performance. Once it degrades, it
>>> shards
>>> the index by either splitting it (you can use IndexSplitter under contrib)
>>> or create a new shard, and direct new documents to it.
>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>>> documentation, about a Balancer object, which moves shards around in order
>>> to balance the load on the cluster. You can implement something similar
>>> which tries to balance the index sizes, creates new shards on-the-fly,
>>> even
>>> merge shards if suddenly a whole source is being removed from the system
>>> etc.
>>> Also, note that the 'largest index size' threshold is really a machine
>>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
>>> it
>>> is pointless to create 10x10GB shards on the same machine -- searching
>>> them
>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>>> even
>>> worse because you consume more RAM when the indexes are split (e.g., terms
>>> index, field infos etc.).
>>> Shai
>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick
>>> Erickson<>wrote:
>>>> <<<We can't assume anything about the machine running it,
>>>> so testing won't really tell us much>>>
>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>> anything you say about running on a machine with
>>>> 2G available memory on a single processor is completely
>>>> incomparable to running on a machine with 64G of
>>>> memory available for Lucene and 16 processors.
>>>> There's really no such thing as an "optimum" Lucene index
>>>> size, it always relates to the characteristics of the
>>>> underlying hardware.
>>>> I think the best you can do is actually test on various
>>>> configurations, then at least you can say "on configuration
>>>> X this is the tipping point".
>>>> Sorry there isn't a better answer that I know of, but...
>>>> Best
>>>> Erick
>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<>
>>>> wrote:
>>>>> Hi all,
>>>>> I know Lucene indexes to be at their optimum up to a certain size - said
>>>> to
>>>>> be around several GBs. I haven't found a good discussion over this, but
>>>> its
>>>>> my understanding that at some point its better to split an index into
>>>> parts
>>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>>> assume
>>>>> this has to do with OS and IO configurations. Can anyone point me to
>>>>> more
>>>>> info on this?
>>>>> We have a product that is using Lucene for various searches, and at the
>>>>> moment each type of search is using its own Lucene index. We plan on
>>>>> refactoring the way it works and to combine all indexes into one -
>>>>> making
>>>>> the whole system more robust and with a smaller memory footprint, among
>>>>> other things.
>>>>> Assuming the above is true, we are interested in knowing how to do this
>>>>> correctly. Initially all our indexes will be run in one big index, but
>>>>> if
>>>> at
>>>>> some index size there is a severe performance degradation we would like
>>>> to
>>>>> handle that correctly by starting a new FSDirectory index to flush into,
>>>> or
>>>>> by re-indexing and moving large indexes into their own Lucene index.
>>>>> Are there are any guidelines for measuring or estimating this correctly?
>>>>> what we should be aware of while considering all that? We can't assume
>>>>> anything about the machine running it, so testing won't really tell us
>>>>> much...
>>>>> Thanks in advance for any input on this,
>>>>> Itamar.
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message