lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarendra Pratap <samarz...@gmail.com>
Subject Re: Sharding Techniques
Date Fri, 13 May 2011 10:11:01 GMT
Hi Tom,
 Thanks for pointing me to something important (phrase queries) which I
wasn't thinking of.

 We are using synonyms which gets expanded at run time. I'll have to give it
a thought.

 We are not using synonyms at indexing time due to lack of flexibility of
changing the list. We are not using synonym analyzer either, because of the
issues related to synonyms of varying word length (any comments?). These
synonyms are expanded at the time of query formation and they do contain
phrases, in fact a good number (if not big) of those.

I would also like to share that following results of initial testing

Comparison between - single index Vs 21 indexes
Total Size - 18 GB
Queries run - 500
% improvement - roughly 18%

Guys here, however, were expecting more :-), but's that's good enough reason
to go for single index.

(details of index and queries are there in the thread)


On Fri, May 13, 2011 at 12:18 AM, Burton-West, Tom <tburtonw@umich.edu>wrote:

> Hi Samar,
>
> Have you looked at top or iostat or other monitoring utilities to see if
> you are cpu bound vs I/O bound?
>
> With 225 term queries, it's possible that you are I/O bound.
>
> I suspect you need to think about seek time and caching. For each unique
> field:term combination lucene has to look up the postings for that term in
> the index.  Additionally for any phrase, lucene has to additionally look up
> the positions data for each term in the phrase. (In our case phrase searches
> are very expensive as our positions (*prx) index is about 8 times as large
> as our frq index) So for 225 terms including some number of phrases, that is
> a lot of disk seeks.  To the extent that the terms are close together in the
> index and various buffer caches contain adjacent terms, you might not
> actually have 225 seeks, but I suspect there will still be a lot.
>
> Although Lucene implements a number of caches (and you should take a look
> at your cache hit ratios), Lucene depends on the OS disk cache to cache
> postings data for individual terms. Most unix/linux OS's use free memory for
> disk caching.  How much memory is available on the machine after the JVM
> gets it allocation?
>
> Have you considered running cache warming queries of your most frequent
> terms/phrases so that the data is in the OS disk cache?
>
> Tom
>
>
> >> When queries (without two fields mentioned above) have a lot of
> >>words/phrases search time is high. E.g I took a query with around 80
> unique
> >>terms (not words) in 5 fields. These terms occur repeatedly and become
> total
> >>225 terms (non-unique). This particular query took 4.2 seconds. the 15
> >>indexes used for this query were of total size 5 G.
> >>Are 225 terms (80 unique) is a very big number?
>
> -----Original Message-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards,
Samar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message