lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Kane <andrewrk...@gmail.com>
Subject Re: Index size and performance degradation
Date Sun, 12 Jun 2011 08:45:47 GMT
In the literature there is some evidence that sharding of in-memory indexes
on multi-core machines might be better.  Has anyone tried this lately?

    http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359

Single disk machines (HDD or SSD) would be slower.  Multi-disk or RAID type
setups might have some benefits.  What's your hardware setup?

Andrew.


On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko <itamar@code972.com>wrote:

> Thanks.
>
>
> The whole point of my question was to find out if and how to make balancing
> on the SAME machine. Apparently thats not going to help and at a certain
> point we will just have to prompt the user to buy more hardware...
>
>
> Out of curiosity, isn't there anything that we can do to avoid that? for
> instance using memory-mapped files for the indexes? anything that would help
> us overcome OS limitations of that sort...
>
>
> Also, you mention a scheduled job to check for performance degradation; any
> idea how serious such a drop should be for sharding to be really beneficial?
> or is it application specific too?
>
>
> Itamar.
>
>
>
> On 12/06/2011 06:43, Shai Erera wrote:
>
>  I agree w/ Erick, there is no cutoff point (index size for that matter)
>> above which you start sharding.
>>
>> What you can do is create a scheduled job in your system that runs a
>> select
>> list of queries and monitors their performance. Once it degrades, it
>> shards
>> the index by either splitting it (you can use IndexSplitter under contrib)
>> or create a new shard, and direct new documents to it.
>>
>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>> documentation, about a Balancer object, which moves shards around in order
>> to balance the load on the cluster. You can implement something similar
>> which tries to balance the index sizes, creates new shards on-the-fly,
>> even
>> merge shards if suddenly a whole source is being removed from the system
>> etc.
>>
>> Also, note that the 'largest index size' threshold is really a machine
>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
>> it
>> is pointless to create 10x10GB shards on the same machine -- searching
>> them
>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>> even
>> worse because you consume more RAM when the indexes are split (e.g., terms
>> index, field infos etc.).
>>
>> Shai
>>
>> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<erickerickson@gmail.com
>> >wrote:
>>
>>  <<<We can't assume anything about the machine running it,
>>> so testing won't really tell us much>>>
>>>
>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>> anything you say about running on a machine with
>>> 2G available memory on a single processor is completely
>>> incomparable to running on a machine with 64G of
>>> memory available for Lucene and 16 processors.
>>>
>>> There's really no such thing as an "optimum" Lucene index
>>> size, it always relates to the characteristics of the
>>> underlying hardware.
>>>
>>> I think the best you can do is actually test on various
>>> configurations, then at least you can say "on configuration
>>> X this is the tipping point".
>>>
>>> Sorry there isn't a better answer that I know of, but...
>>>
>>> Best
>>> Erick
>>>
>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<itamar@code972.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I know Lucene indexes to be at their optimum up to a certain size - said
>>>>
>>> to
>>>
>>>> be around several GBs. I haven't found a good discussion over this, but
>>>>
>>> its
>>>
>>>> my understanding that at some point its better to split an index into
>>>>
>>> parts
>>>
>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>> assume
>>>> this has to do with OS and IO configurations. Can anyone point me to
>>>> more
>>>> info on this?
>>>>
>>>> We have a product that is using Lucene for various searches, and at the
>>>> moment each type of search is using its own Lucene index. We plan on
>>>> refactoring the way it works and to combine all indexes into one -
>>>> making
>>>> the whole system more robust and with a smaller memory footprint, among
>>>> other things.
>>>>
>>>> Assuming the above is true, we are interested in knowing how to do this
>>>> correctly. Initially all our indexes will be run in one big index, but
>>>> if
>>>>
>>> at
>>>
>>>> some index size there is a severe performance degradation we would like
>>>>
>>> to
>>>
>>>> handle that correctly by starting a new FSDirectory index to flush into,
>>>>
>>> or
>>>
>>>> by re-indexing and moving large indexes into their own Lucene index.
>>>>
>>>> Are there are any guidelines for measuring or estimating this correctly?
>>>> what we should be aware of while considering all that? We can't assume
>>>> anything about the machine running it, so testing won't really tell us
>>>> much...
>>>>
>>>> Thanks in advance for any input on this,
>>>>
>>>> Itamar.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message