lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Index size and performance degradation
Date Sun, 12 Jun 2011 17:29:42 GMT
>
> Shai, what would you call a smart app-level cache? remembering frequent
> searches and storing them handy?


Remembering frequent searches is good. If you do this, you can warm up the
cache whenever a new IndexSearcher is opened (e.g., if you use
SearcherManager from LIA2) and besides keeping the results 'ready', you also
keep them up-to-date.

Another thing to consider is session-level cache. This can have several
uses:
(1) In non-AJAX apps (probably not too many today, hopefully?), a page
reload can issue several queries to the backend, while usually only one
portion of the page gets updated, so caching the queries the user submitted
during his session will help here.
(2) If users interact w/ your web app by, e.g., repeating the same actions,
that will help. An example is a user who frequently clicks the "Back" and
"Forward" buttons in the browser. Although, there are client-side solutions
for that, using Dojo, which helps you store the 'previous' pages the user
visited.
(3) If your app is ACL-constrained, caching the user ACLs and their matching
docs will be very useful.

In another app I'm involved with there are several Filters that exist. Their
number varies between deployments, but for each they are fixed and known in
advance. Each Filter matches a different set of documents and queries are
always added one or more Filters. So we cache them (and warm them up when
opening new IndexSearchers) using CachingWrapperFilter, which is great since
it works at the segment level, so warming up is very fast usually.

Another cache, which is very high-level are pre-defined queries with their
matching result set. I.e., for queries like "my-company-name" you always
return a fixed N results, and only if the user pages through them, do you
run the actual query.

And the list can go on and on :).

Shai

On Sun, Jun 12, 2011 at 8:13 PM, Itamar Syn-Hershko <itamar@code972.com>wrote:

> Shai, what would you call a smart app-level cache? remembering frequent
> searches and storing them handy? or are there more advanced techniques for
> that? any pointers appreciated...
>
>
> Thanks for all the advice!
>
>
> On 12/06/2011 11:42, Shai Erera wrote:
>
>  isn't there anything that we can do to avoid that?
>>>
>>>  That was my point :) -->  you can optimize your search application, use
>> mmap
>> files, smart caches etc., until it reaches a point where you need to
>> shard.
>> But it's still application dependent, not much of an OS thing. You can
>> count
>> on the OS to cache what it needs in RAM, and if your index is small enough
>> to exist in RAM, then it will probably be there. We've tried in the past
>> to
>> use RAMDirectory for GBs of indexes (we had the RAM to spare), and the OS
>> cache just did a better job.
>>
>> On the other hand, you can have a 100GB index, but very smart app-level
>> caches that return results in few ms ...
>>
>> any idea how serious such a drop should be for sharding to be really
>>
>>> beneficial? or is it application specific too?
>>>
>>
>> That's application specific too I'm afraid. For instance, if your system
>> is
>> expected to support 10 queries/sec, and that tool determines that it no
>> longer supports it, but dropped to, say, 7, then that is not something
>> you're willing to tolerate and therefore you shard the index.
>>
>> But I've been working w/ applications that achieved 80 queries/sec on a
>> large index on one machine, and others that were willing to accept 30
>> seconds and even higher response time per query (for total recall, usually
>> legal stuff). So, again, it's really hard to come up w/ a magic number :).
>>
>> People are used to Google's sub-second search response time. So if your
>> app
>> is aiming to give the same experience, then factor in some reasonable
>> statistics like:
>> * No query takes longer than 5s
>> * Majority of the queries, say 80%, finish in<  500ms
>> * Above still holds in X queries/sec rate (X is dynamic and depends on
>> what
>> you aim for)
>>
>> These are just some numbers I've been using recently to benchmark my app
>>
>> Shai
>>
>> On Sun, Jun 12, 2011 at 11:10 AM, Itamar Syn-Hershko<itamar@code972.com
>> >wrote:
>>
>>  Thanks.
>>>
>>>
>>> The whole point of my question was to find out if and how to make
>>> balancing
>>> on the SAME machine. Apparently thats not going to help and at a certain
>>> point we will just have to prompt the user to buy more hardware...
>>>
>>>
>>> Out of curiosity, isn't there anything that we can do to avoid that? for
>>> instance using memory-mapped files for the indexes? anything that would
>>> help
>>> us overcome OS limitations of that sort...
>>>
>>>
>>> Also, you mention a scheduled job to check for performance degradation;
>>> any
>>> idea how serious such a drop should be for sharding to be really
>>> beneficial?
>>> or is it application specific too?
>>>
>>>
>>> Itamar.
>>>
>>>
>>>
>>> On 12/06/2011 06:43, Shai Erera wrote:
>>>
>>>  I agree w/ Erick, there is no cutoff point (index size for that matter)
>>>
>>>> above which you start sharding.
>>>>
>>>> What you can do is create a scheduled job in your system that runs a
>>>> select
>>>> list of queries and monitors their performance. Once it degrades, it
>>>> shards
>>>> the index by either splitting it (you can use IndexSplitter under
>>>> contrib)
>>>> or create a new shard, and direct new documents to it.
>>>>
>>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>>>> documentation, about a Balancer object, which moves shards around in
>>>> order
>>>> to balance the load on the cluster. You can implement something similar
>>>> which tries to balance the index sizes, creates new shards on-the-fly,
>>>> even
>>>> merge shards if suddenly a whole source is being removed from the system
>>>> etc.
>>>>
>>>> Also, note that the 'largest index size' threshold is really a machine
>>>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff,
>>>> it
>>>> is pointless to create 10x10GB shards on the same machine -- searching
>>>> them
>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>>>> even
>>>> worse because you consume more RAM when the indexes are split (e.g.,
>>>> terms
>>>> index, field infos etc.).
>>>>
>>>> Shai
>>>>
>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<erickerickson@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>  <<<We can't assume anything about the machine running it,
>>>>
>>>>> so testing won't really tell us much>>>
>>>>>
>>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>>> anything you say about running on a machine with
>>>>> 2G available memory on a single processor is completely
>>>>> incomparable to running on a machine with 64G of
>>>>> memory available for Lucene and 16 processors.
>>>>>
>>>>> There's really no such thing as an "optimum" Lucene index
>>>>> size, it always relates to the characteristics of the
>>>>> underlying hardware.
>>>>>
>>>>> I think the best you can do is actually test on various
>>>>> configurations, then at least you can say "on configuration
>>>>> X this is the tipping point".
>>>>>
>>>>> Sorry there isn't a better answer that I know of, but...
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<itamar@code972.com
>>>>> >
>>>>> wrote:
>>>>>
>>>>>  Hi all,
>>>>>>
>>>>>> I know Lucene indexes to be at their optimum up to a certain size
-
>>>>>> said
>>>>>>
>>>>>>  to
>>>>>
>>>>>  be around several GBs. I haven't found a good discussion over this,
>>>>>> but
>>>>>>
>>>>>>  its
>>>>>
>>>>>  my understanding that at some point its better to split an index into
>>>>>>
>>>>>>  parts
>>>>>
>>>>>  (a la sharding) than to continue searching on a huge-size index. I
>>>>>> assume
>>>>>> this has to do with OS and IO configurations. Can anyone point me
to
>>>>>> more
>>>>>> info on this?
>>>>>>
>>>>>> We have a product that is using Lucene for various searches, and
at
>>>>>> the
>>>>>> moment each type of search is using its own Lucene index. We plan
on
>>>>>> refactoring the way it works and to combine all indexes into one
-
>>>>>> making
>>>>>> the whole system more robust and with a smaller memory footprint,
>>>>>> among
>>>>>> other things.
>>>>>>
>>>>>> Assuming the above is true, we are interested in knowing how to do
>>>>>> this
>>>>>> correctly. Initially all our indexes will be run in one big index,
but
>>>>>> if
>>>>>>
>>>>>>  at
>>>>>
>>>>>  some index size there is a severe performance degradation we would
>>>>>> like
>>>>>>
>>>>>>  to
>>>>>
>>>>>  handle that correctly by starting a new FSDirectory index to flush
>>>>>> into,
>>>>>>
>>>>>>  or
>>>>>
>>>>>  by re-indexing and moving large indexes into their own Lucene index.
>>>>>>
>>>>>> Are there are any guidelines for measuring or estimating this
>>>>>> correctly?
>>>>>> what we should be aware of while considering all that? We can't assume
>>>>>> anything about the machine running it, so testing won't really tell
us
>>>>>> much...
>>>>>>
>>>>>> Thanks in advance for any input on this,
>>>>>>
>>>>>> Itamar.
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>  ---------------------------------------------------------------------
>>>>>>
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message