lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Bazhenov <dot...@gmail.com>
Subject Re: Index size and performance degradation
Date Thu, 16 Jun 2011 08:42:33 GMT
To summarize what was said before.

In general, using sharding on single machine does make sense only if using single lucene instance
you could not utilize all the hardware on this machine. For it's own lucene does a good job
in this area, so I think it's very rarely situation where you really need to shard lucene
index on single machine.

One particular case when you want to shard that comes to my mind. If you have large JVM heap
(> 
4-8Gb) and machine with very large RAM size  it makes sense to shard for keeping GC ovehead
small (AFAIK JVM have some troubles on large heap sizes). Also I thinks there is some potential
in sharding if you do a lot of heavy lifting sorting. But again, it does make sense only if
there is no a lot of concurrent requests going in the system.

On 16.06.2011, at 16:44, Shai Erera wrote:

>> 
>> If single large index goes beyond GB, It may take more time to merge and
>> optimize.
>> 
> 
> This can be achieved w/ a single index too. LogMergePolicy allows setting a
> maxMergeMB and maxMergeMBForOptimize, which are thresholds that define the
> largest segment size to be merged. TieredMergePolicy, as far as I
> understand, goes one step further and lets you specify the largest segment
> size you'd wish to see as a result of a merge.
> 
> IO might be more as it needs to skip large amount bytes to locate the exact
>> location
>> 
> 
> I don't believe there's much difference between a single index and multiple
> ones. Lucene does not control how the low-level IO subsystem works. It could
> be that the larger index would perform better, or vice versa. I would keep
> the stored fields separate from the content index in two cases:
> 
> (1) You can ensure that the stored fields are stored on a separate physical
> disk than the content index, in which case you're more likely to get
> concurrency for fetching results and doing searches.
> 
> (2) The stored fields are kept in their own cluster. But this is really for
> large deployments, w/ many shards and high query volume. In those scenarios,
> queries are executed on one (logical) cluster, and then results are fed off
> to another logical cluster to produce the summaries. I doubt it fits more
> than a handful of systems though.
> 
> Shai
> 
> On Thu, Jun 16, 2011 at 8:10 AM, Ganesh <emailgane@yahoo.co.in> wrote:
> 
>> Any one could tthow some light on this? Is it a bad idea to keep multiple
>> shards in a single system?
>> 
>> Below are my reasons, Please correct me if iam wrong.
>> 
>> 1. If single large index goes beyond GB, It may take more time to merge and
>> optimize.
>> 2. Consider the total size of index is around 10 GB, then fdt file might be
>> in 3 - 4GB. In order to display result summary we may need to fetch the
>> field values from the fdt file. IO might be more as it needs to skip large
>> amount bytes to locate the exact location. In other words the search summary
>> retrieval might slow.
>> 3. It is really good for less number of concurrent users going to search at
>> a time.
>> 
>> Regards
>> Ganesh
>> 
>> 
>> 
>> ----- Original Message -----
>> From: "Ganesh" <emailgane@yahoo.co.in>
>> To: <java-user@lucene.apache.org>; <te@statsbiblioteket.dk>
>> Sent: Tuesday, June 14, 2011 3:28 PM
>> Subject: Re: Index size and performance degradation
>> 
>> 
>> Is it a bad idea to keep multiple shards in a single system?
>> 
>> Regards
>> Ganesh
>> 
>> ----- Original Message -----
>> From: "Toke Eskildsen" <te@statsbiblioteket.dk>
>> To: <java-user@lucene.apache.org>
>> Sent: Tuesday, June 14, 2011 12:58 PM
>> Subject: Re: Index size and performance degradation
>> 
>> 
>>> On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote:
>>>> The whole point of my question was to find out if and how to make
>>>> balancing on the SAME machine. Apparently thats not going to help and at
>>>> a certain point we will just have to prompt the user to buy more
>> hardware...
>>> 
>>> It really depends on your scenario. If you have few concurrent requests
>>> and are looking to minimize latency, sharding might help; assuming you
>>> have fast IO and multiple cores. You basically want to saturate all
>>> available resources for all requests.
>>> 
>>> On the other hand, if throughput is the issue, sharding on a single
>>> machine is counter-productive due to increased duplication and merging.
>>> 
>>>> Out of curiosity, isn't there anything that we can do to avoid that? for
>>>> instance using memory-mapped files for the indexes? anything that would
>>>> help us overcome OS limitations of that sort...
>>> 
>>> One standard advice for speeding up searches is using SSD's. Our
>>> (admittedly old) experiments puts SSD-performance near RAM. With the
>>> prices we have now, SSD's seems like an obvious choice for most setups.
>>> 
>>> We tried a few performance tests at different index sizes and for us,
>>> index size vs. performance looked like the power law: Heavy performance
>>> degradation in the beginning, less later. It makes sense when we look at
>>> caching and it means that if you do not require stellar performance, you
>>> can have very large indexes on few machines (cue Hathi Trust).
>>> 
>>> - Toke Eskildsen
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---
Denis Bazhenov <dotsid@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message