lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Details on setting block parameters for Lucene41PostingsFormat
Date Sun, 11 Jan 2015 02:28:25 GMT
Tom:

I'll be very interested to see your final numbers. I did a worst-case
test at one
point and saw a 2/3 reduction, but.... that was deliberately "worst
case", I used
a bunch of string/text types, did some faceting on them, etc, IOW not real-world
at all. So it'll be cool to see what you come up with.

The other benefit is that you have many, many few objects allocated on the heap,
I was seeing two orders of magnitude fewer. That's right, 99%
reduction. Again, though,
I was deliberately doing really bad stuff....

Best,
Erick

On Sat, Jan 10, 2015 at 4:58 PM, Tom Burton-West <tburtonw@umich.edu> wrote:
> Thanks Mike,
>
> We run our Solr 3.x indexing with 10GB/shard.  I've been testing Solr 4
> with 4,6, and 8GB for heap.  As of Friday night when the indexes were about
> half done (about 400GB on disk) only the 4GB had issues.  I'll find out on
> Monday if the other runs had issues.  If we can go from 10GB in Solr 3.x to
> 6GB with Solr 4.x, that will be a significant change.
>
> With TermsIndexInterval we traded off less memory use for increased chance
> of disk seeks and more data to be read per seek (and if I remember right,
> that more data was scanned sequentially rather than binary searched.)
> What is the trade-off when increasing the block size?
>
> Tom
>
> On Sat, Jan 10, 2015 at 4:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> The first int to Lucene41PostingsFormat is the min block size (default
>> 25) and the second is the max (default 48) for the block tree terms
>> dict.
>>
>> The max must be >= 2*(min-1).
>>
>> Since you were using 8X the default before, maybe try min=200 and
>> max=398?  However, block tree should have been more RAM efficient than
>> 3.x's terms index... if you run CheckIndex with -verbose it will print
>> additional details about the block structure of your terms indices...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jan 9, 2015 at 4:15 PM, Tom Burton-West <tburtonw@umich.edu>
>> wrote:
>> > Hello all,
>> >
>> > We have over 3 billion unique terms in our indexes and with Solr 3.x we
>> set
>> > the TermIndexInterval to about 8 times its default value in order to
>> index
>> > without OOMs.  (
>> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again)
>> >
>> > We are now working with Solr 4 and running into memory issues and are
>> > wondering if we need to do something analogous for Solr 4.
>> >
>> > The javadoc for IndexWriterConfig (
>> >
>> http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>> > )
>> > indicates that the lucene 4.1 postings format has some parameters which
>> may
>> > be set:
>> > "..To configure its parameters (the minimum and maximum size for a
>> block),
>> > you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int,
>> > int)
>> > <
>> https://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat%28int,%20int%29
>> >
>> > "
>> >
>> > Is there documentation or discussion somewhere about how to determine
>> > appropriate parameters or some detail about what setting the maxBlockSize
>> > and minBlockSize does?
>> >
>> > Tom Burton-West
>> > http://www.hathitrust.org/blogs/large-scale-search
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message