lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: Index size and performance degradation
Date Sun, 12 Jun 2011 17:13:00 GMT
Shai, what would you call a smart app-level cache? remembering frequent 
searches and storing them handy? or are there more advanced techniques 
for that? any pointers appreciated...


Thanks for all the advice!


On 12/06/2011 11:42, Shai Erera wrote:

>> isn't there anything that we can do to avoid that?
>>
> That was my point :) -->  you can optimize your search application, use mmap
> files, smart caches etc., until it reaches a point where you need to shard.
> But it's still application dependent, not much of an OS thing. You can count
> on the OS to cache what it needs in RAM, and if your index is small enough
> to exist in RAM, then it will probably be there. We've tried in the past to
> use RAMDirectory for GBs of indexes (we had the RAM to spare), and the OS
> cache just did a better job.
>
> On the other hand, you can have a 100GB index, but very smart app-level
> caches that return results in few ms ...
>
> any idea how serious such a drop should be for sharding to be really
>> beneficial? or is it application specific too?
>
> That's application specific too I'm afraid. For instance, if your system is
> expected to support 10 queries/sec, and that tool determines that it no
> longer supports it, but dropped to, say, 7, then that is not something
> you're willing to tolerate and therefore you shard the index.
>
> But I've been working w/ applications that achieved 80 queries/sec on a
> large index on one machine, and others that were willing to accept 30
> seconds and even higher response time per query (for total recall, usually
> legal stuff). So, again, it's really hard to come up w/ a magic number :).
>
> People are used to Google's sub-second search response time. So if your app
> is aiming to give the same experience, then factor in some reasonable
> statistics like:
> * No query takes longer than 5s
> * Majority of the queries, say 80%, finish in<  500ms
> * Above still holds in X queries/sec rate (X is dynamic and depends on what
> you aim for)
>
> These are just some numbers I've been using recently to benchmark my app
>
> Shai
>
> On Sun, Jun 12, 2011 at 11:10 AM, Itamar Syn-Hershko<itamar@code972.com>wrote:
>
>> Thanks.
>>
>>
>> The whole point of my question was to find out if and how to make balancing
>> on the SAME machine. Apparently thats not going to help and at a certain
>> point we will just have to prompt the user to buy more hardware...
>>
>>
>> Out of curiosity, isn't there anything that we can do to avoid that? for
>> instance using memory-mapped files for the indexes? anything that would help
>> us overcome OS limitations of that sort...
>>
>>
>> Also, you mention a scheduled job to check for performance degradation; any
>> idea how serious such a drop should be for sharding to be really beneficial?
>> or is it application specific too?
>>
>>
>> Itamar.
>>
>>
>>
>> On 12/06/2011 06:43, Shai Erera wrote:
>>
>>   I agree w/ Erick, there is no cutoff point (index size for that matter)
>>> above which you start sharding.
>>>
>>> What you can do is create a scheduled job in your system that runs a
>>> select
>>> list of queries and monitors their performance. Once it degrades, it
>>> shards
>>> the index by either splitting it (you can use IndexSplitter under contrib)
>>> or create a new shard, and direct new documents to it.
>>>
>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>>> documentation, about a Balancer object, which moves shards around in order
>>> to balance the load on the cluster. You can implement something similar
>>> which tries to balance the index sizes, creates new shards on-the-fly,
>>> even
>>> merge shards if suddenly a whole source is being removed from the system
>>> etc.
>>>
>>> Also, note that the 'largest index size' threshold is really a machine
>>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
>>> it
>>> is pointless to create 10x10GB shards on the same machine -- searching
>>> them
>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>>> even
>>> worse because you consume more RAM when the indexes are split (e.g., terms
>>> index, field infos etc.).
>>>
>>> Shai
>>>
>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<erickerickson@gmail.com
>>>> wrote:
>>>   <<<We can't assume anything about the machine running it,
>>>> so testing won't really tell us much>>>
>>>>
>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>> anything you say about running on a machine with
>>>> 2G available memory on a single processor is completely
>>>> incomparable to running on a machine with 64G of
>>>> memory available for Lucene and 16 processors.
>>>>
>>>> There's really no such thing as an "optimum" Lucene index
>>>> size, it always relates to the characteristics of the
>>>> underlying hardware.
>>>>
>>>> I think the best you can do is actually test on various
>>>> configurations, then at least you can say "on configuration
>>>> X this is the tipping point".
>>>>
>>>> Sorry there isn't a better answer that I know of, but...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<itamar@code972.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I know Lucene indexes to be at their optimum up to a certain size - said
>>>>>
>>>> to
>>>>
>>>>> be around several GBs. I haven't found a good discussion over this, but
>>>>>
>>>> its
>>>>
>>>>> my understanding that at some point its better to split an index into
>>>>>
>>>> parts
>>>>
>>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>>> assume
>>>>> this has to do with OS and IO configurations. Can anyone point me to
>>>>> more
>>>>> info on this?
>>>>>
>>>>> We have a product that is using Lucene for various searches, and at the
>>>>> moment each type of search is using its own Lucene index. We plan on
>>>>> refactoring the way it works and to combine all indexes into one -
>>>>> making
>>>>> the whole system more robust and with a smaller memory footprint, among
>>>>> other things.
>>>>>
>>>>> Assuming the above is true, we are interested in knowing how to do this
>>>>> correctly. Initially all our indexes will be run in one big index, but
>>>>> if
>>>>>
>>>> at
>>>>
>>>>> some index size there is a severe performance degradation we would like
>>>>>
>>>> to
>>>>
>>>>> handle that correctly by starting a new FSDirectory index to flush into,
>>>>>
>>>> or
>>>>
>>>>> by re-indexing and moving large indexes into their own Lucene index.
>>>>>
>>>>> Are there are any guidelines for measuring or estimating this correctly?
>>>>> what we should be aware of while considering all that? We can't assume
>>>>> anything about the machine running it, so testing won't really tell us
>>>>> much...
>>>>>
>>>>> Thanks in advance for any input on this,
>>>>>
>>>>> Itamar.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>   ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message