Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4DF525B6.2000902@code972.com>
Date: Sun, 12 Jun 2011 23:46:46 +0300
From: Itamar Syn-Hershko <itamar@code972.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
 rv:1.9.2.17) Gecko/20110414 Lightning/1.0b2 Thunderbird/3.1.10
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Index size and performance degradation
References: <4DF3C40F.9090703@code972.com>
	<BANLkTi=2U3RTtOq_1oiA_FGBwpiPcVVMyg@mail.gmail.com>
	<BANLkTinHragJQUTX3j+mcdW0K0B73A0T=A@mail.gmail.com>
	<4DF47461.1050004@code972.com>
	<BANLkTim_RCpPNkb2WyGwCJ3a0eDyiOBGjA@mail.gmail.com>
	<4DF504B2.4000105@code972.com>
 <BANLkTim_sX8CQMknZJwWX4q6pt6Jx5wrWQ@mail.gmail.com>
In-Reply-To: <BANLkTim_sX8CQMknZJwWX4q6pt6Jx5wrWQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Thanks for your detailed answer. We'll have to tackle this and see whats 
more important to us then. I'd definitely love to hear Zoie has overcame 
all that...


Any pointers to Michael Busch's approach? I take this has something to 
do with the core itself or index format, probably using the Flex version?


Itamar.


On 12/06/2011 23:12, Michael McCandless wrote:

> > From what I understand of Zoie (and it's been some time since I last
> looked... so this could be wrong now), the biggest difference vs NRT
> is that Zoie aims for "immediate consistency", ie index changes are
> always made visible to the very next query, vs NRT which is
> "controlled consistency", a blend between immediate and eventual
> consistency where your app decides when the changes must become
> visible.
>
> But in exchange for that, Zoie pays a price: each search has a higher
> cost per collected hit, since it must post-filter for deleted docs.
> And since Zoie necessarily adds complexity, there's more risk; eg
> there were some nasty Zoie bugs that took quite some time to track
> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>
> Anyway, I don't think that's a good tradeoff, in general, for our
> users, because very few apps truly require immediate consistency from
> Lucene (can anyone give an example where their app depends on
> immediate consistency...?).  I think it's better to spend time during
> reopen so that searches aren't slower.
>
> That said, Lucene has already incorporated one big part of Zoie
> (caching small segments in RAM) via the new NRTCachingDirectory (in
> contrib/misc).  Also, the upcoming NRTManager
> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
> visibility of specific indexing changes to queries that need to see
> the changes.
>
> Finally, even better would be to not have to make any tradeoff
> whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
> bring immediate consistency with no search performance hit, so if we
> do anything here likely it'll be similar to what Michael has done
> (though, those changes are not simple either!).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<itamar@code972.com>  wrote:
>> Mike,
>>
>>
>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently
>> isn't fast enough if Zoie was needed, and now that Zoie is around are there
>> any plans to make it Lucene's default? or: why would one still use NRT when
>> Zoie seem to work much better?
>>
>>
>> Itamar.
>>
>>
>> On 12/06/2011 13:16, Michael McCandless wrote:
>>
>>> Remember that memory-mapping is not a panacea: at the end of the day,
>>> if there just isn't enough RAM on the machine to keep your full
>>> "working set" hot, then the OS will have to hit the disk, regardless
>>> of whether the access is through MMap or a "traditional" IO request.
>>>
>>> That said, on Fedora Linux anyway, I generally see better performance
>>> from MMap than from NIOFSDir; eg see the 2nd chart here:
>>>
>>>
>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<itamar@code972.com>
>>>   wrote:
>>>> Thanks.
>>>>
>>>>
>>>> The whole point of my question was to find out if and how to make
>>>> balancing
>>>> on the SAME machine. Apparently thats not going to help and at a certain
>>>> point we will just have to prompt the user to buy more hardware...
>>>>
>>>>
>>>> Out of curiosity, isn't there anything that we can do to avoid that? for
>>>> instance using memory-mapped files for the indexes? anything that would
>>>> help
>>>> us overcome OS limitations of that sort...
>>>>
>>>>
>>>> Also, you mention a scheduled job to check for performance degradation;
>>>> any
>>>> idea how serious such a drop should be for sharding to be really
>>>> beneficial?
>>>> or is it application specific too?
>>>>
>>>>
>>>> Itamar.
>>>>
>>>>
>>>> On 12/06/2011 06:43, Shai Erera wrote:
>>>>
>>>>> I agree w/ Erick, there is no cutoff point (index size for that matter)
>>>>> above which you start sharding.
>>>>>
>>>>> What you can do is create a scheduled job in your system that runs a
>>>>> select
>>>>> list of queries and monitors their performance. Once it degrades, it
>>>>> shards
>>>>> the index by either splitting it (you can use IndexSplitter under
>>>>> contrib)
>>>>> or create a new shard, and direct new documents to it.
>>>>>
>>>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>>>>> documentation, about a Balancer object, which moves shards around in
>>>>> order
>>>>> to balance the load on the cluster. You can implement something similar
>>>>> which tries to balance the index sizes, creates new shards on-the-fly,
>>>>> even
>>>>> merge shards if suddenly a whole source is being removed from the system
>>>>> etc.
>>>>>
>>>>> Also, note that the 'largest index size' threshold is really a machine
>>>>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
>>>>> it
>>>>> is pointless to create 10x10GB shards on the same machine -- searching
>>>>> them
>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>>>>> even
>>>>> worse because you consume more RAM when the indexes are split (e.g.,
>>>>> terms
>>>>> index, field infos etc.).
>>>>>
>>>>> Shai
>>>>>
>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick
>>>>> Erickson<erickerickson@gmail.com>wrote:
>>>>>
>>>>>> <<<We can't assume anything about the machine running it,
>>>>>> so testing won't really tell us much>>>
>>>>>>
>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>>>> anything you say about running on a machine with
>>>>>> 2G available memory on a single processor is completely
>>>>>> incomparable to running on a machine with 64G of
>>>>>> memory available for Lucene and 16 processors.
>>>>>>
>>>>>> There's really no such thing as an "optimum" Lucene index
>>>>>> size, it always relates to the characteristics of the
>>>>>> underlying hardware.
>>>>>>
>>>>>> I think the best you can do is actually test on various
>>>>>> configurations, then at least you can say "on configuration
>>>>>> X this is the tipping point".
>>>>>>
>>>>>> Sorry there isn't a better answer that I know of, but...
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<itamar@code972.com>
>>>>>> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I know Lucene indexes to be at their optimum up to a certain size -
>>>>>>> said
>>>>>> to
>>>>>>> be around several GBs. I haven't found a good discussion over this,
>>>>>>> but
>>>>>> its
>>>>>>> my understanding that at some point its better to split an index into
>>>>>> parts
>>>>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>>>>> assume
>>>>>>> this has to do with OS and IO configurations. Can anyone point me to
>>>>>>> more
>>>>>>> info on this?
>>>>>>>
>>>>>>> We have a product that is using Lucene for various searches, and at
>>>>>>> the
>>>>>>> moment each type of search is using its own Lucene index. We plan on
>>>>>>> refactoring the way it works and to combine all indexes into one -
>>>>>>> making
>>>>>>> the whole system more robust and with a smaller memory footprint,
>>>>>>> among
>>>>>>> other things.
>>>>>>>
>>>>>>> Assuming the above is true, we are interested in knowing how to do
>>>>>>> this
>>>>>>> correctly. Initially all our indexes will be run in one big index, but
>>>>>>> if
>>>>>> at
>>>>>>> some index size there is a severe performance degradation we would
>>>>>>> like
>>>>>> to
>>>>>>> handle that correctly by starting a new FSDirectory index to flush
>>>>>>> into,
>>>>>> or
>>>>>>> by re-indexing and moving large indexes into their own Lucene index.
>>>>>>>
>>>>>>> Are there are any guidelines for measuring or estimating this
>>>>>>> correctly?
>>>>>>> what we should be aware of while considering all that? We can't assume
>>>>>>> anything about the machine running it, so testing won't really tell us
>>>>>>> much...
>>>>>>>
>>>>>>> Thanks in advance for any input on this,
>>>>>>>
>>>>>>> Itamar.
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org