lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <>
Subject Re: [jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API
Date Tue, 03 Nov 2009 20:58:33 GMT
I use a low merge factor now too - and recomend it to others. But a  
lot of users like to use crazy high merge factors.

I'm not arguing the absolute worst case is common. It almost never is.

- Mark (mobile)

On Nov 3, 2009, at 12:52 PM, Jake Mannix <> wrote:

> There are really not that many hoops you need to jump through to be  
> able to periodically optimize down to 10 segments or so.  I've used  
> lucene at plenty of other places before LinkedIn, and rarely (since  
> 2.3's indexing speed blew through the roof) have I had to worry  
> about setting the merge factor too high, and even when I do, you  
> simply index into another directory while you're optimizing the  
> original (which is now kept read-only while keeping an in-memory  
> delete set).  It's not that hard, and it does wonders for your  
> performance.
> Sure, plenty of lucene installations can't optimize, and while many  
> of those could do with some much-needed refactoring to allow them  
> the possibility of doing that (otherwise, you get what happened at  
> my old company before I worked there - there was never any optimize,  
> high merge factor, and commit after every document [ouch!], and  
> eventually query latency went through the roof and the system just  
> fell over), I understand that not everyone is going to do that.
> But even in these installations, I'm still saying that you've  
> narrowed the field down to a very tiny number if you add up all the  
> requirements for multiPQ to be painful for them (seriously: when is  
> 40MB going to hurt a system that's designed to handle 100QPS per  
> box?  Or when does 4MB hurt one designed to handle 10QPS?)
>   -jake
> On Tue, Nov 3, 2009 at 12:40 PM, Mark Miller <>  
> wrote:
> Not *ever* being able to optimize is a common case, without jumping  
> a lot of hoops. There are many systems that need to be on nearly  
> 24/7 - an optimize on a large index can take many hours - usually an  
> unknown number. Linkedin and it's use cases are not the only  
> consumers of lucene.
> - Mark
> (mobile)
> On Nov 3, 2009, at 10:51 AM, "Jake Mannix (JIRA)" <>  
> wrote:
>   [

>  ]
> Jake Mannix commented on LUCENE-1997:
> -------------------------------------
> bq. Since each approach has distinct advantages, why not offer both  
> ("simple" and "expert") comparator extensions APIs?
> +1 from me on this one, as long as the simpler one is around.  I'll  
> bet we'll find that we regret keeping the "expert" one by 3.2 or so  
> though, but I'll take any compromise which gets the simpler API in  
> there.
> bq. Don't forget that this is multiplied by however many queries are  
> currently in flight.
> Sure, so if you're running with 100 queries per second on a single  
> shard (pretty fast!), with 100 segments, and you want to do sorting  
> by value on the top 1000 values (how far down the long tail of  
> extreme cases are we at now?  Do librarians hit their search servers  
> with 100 QPS and have indices poorly built with hundreds of segments  
> and can't take downtime to *ever* optimize?), we're now talking  
> about 40MB.
> *Forty megabytes*.  On a beefy machine which is supposed to be  
> handling 100QPS across an index big enough to need 100 segments.   
> How much heap would such a machine already be allocating?  4GB?  6?   
> More?
> We're talking about less than 1% of the heap is being used by the  
> multiPQ approach in comparison to singlePQ.
> Explore performance of multi-PQ vs single-PQ sorting API
> --------------------------------------------------------
>               Key: LUCENE-1997
>               URL:
>           Project: Lucene - Java
>        Issue Type: Improvement
>        Components: Search
>  Affects Versions: 2.9
>          Reporter: Michael McCandless
>          Assignee: Michael McCandless
>       Attachments: LUCENE-1997.patch, LUCENE-1997.patch,  
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,  
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch,  
> LUCENE-1997.patch
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>  * Index with 20 balanced segments vs index with the "normal" log
>   segment size
>  * Queries with different numbers of hits (only for wikipedia index)
>  * Different top N
>  * Different sorts (by title, for wikipedia, and by random string,
>   random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message