lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: ConcurrentMergeScheduler and MergePolicy question
Date Sat, 08 Aug 2009 22:39:47 GMT
Mark,

On a system where the size of the index is 10 times the amount
of RAM, lets say 10GB RAM and 100GB index, is it ok for optimize
to take 30-60 minutes? Maybe the performance trade off (10-20%
less search performance) is worth it? Otherwise the optimize
literally takes down the machine.

Perhaps the ideal search system architecture that requires
optimizing is to dedicate a server to it, copy the index to the
optimize server, do the optimize, copy the index off (to a
search server) and start again for the next optimize task.

I wonder how/if this would work with Hadoop/HDFS as copying
100GB around would presumably tie up the network? Also, I've
found rsyncing large optimized indexes to be time consuming and
wreaks havoc on the searcher server's IO subsystem. Usually this
is unacceptable for the user as the queries will suddenly
degrade.

-J

On Mon, Aug 3, 2009 at 12:59 PM, Mark Miller<markrmiller@gmail.com> wrote:
> Michael McCandless wrote:
>
> After reading that, I played with some sorting code I had and did a quick
> cheesy test or two - one segment vs a 10 or 20. In that horrible test (based
> on the stress sort code), I don't remember seeing much of a difference. No
> sorting. Very, very unscientific, quick and dirty.
>
> This time I loaded up 1.3 million wikipedia articles, gave the test 768MB of
> RAM, warmed the Searcher with lots of searching before each measurement, and
> compared 1 segment vs 5. The optimized index was 15-20% faster with the
> queries I was using (approx 100 queries targeted at wikipedia). Its an odd
> test system - Ubuntu, Quad core laptop with slow laptop drives and 4 gig of
> RAM. Still not very scientific, but better than before.
>
>
> Here is the benchmark I was using in various forms:
>
> { "Rounds"
>
>   ResetSystemErase
>
>   { "Populate"
>       -CreateIndex
>       { "MAddDocs" AddDoc > : 15000
>       -CloseIndex
>   }
>   { "test"
>       OpenReader       { "WarmRdrDocs" Warm > : 50
>       { "WarmRdr" Search > : 5000
>       { "SearchSameRdr" Search > : 50000
>       CloseReader
>                             OpenIndex
>       PrintSegmentCount
>       Optimize         CloseIndex               NewRound
>   } : 2
> }
>
> RepSumByName
> RepSumByPrefRound SearchSameRdr
>
>
> I also did a quick profile for a 15k index, 1seg vs 10 segs. I profiled each
> for approx 11 million calls of readVint. The hotspot results are below.
>
> http://myhardshadow.com/images/1seg.png
> http://myhardshadow.com/images/10seg.png
>
>
> Just a quick start at looking into this from over the weekend.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message