lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: For indexing: how to estimate needed memory?
Date Thu, 03 May 2007 21:06:03 GMT
Coincidentally, I'm hacking at this very problem....

First, are you sure you're free memory calculation is OK? Why not
just use freeMemory? Perhaps also calling the gc if the avail isn't
enough. Although I confess I don't know the innards of the
interplay of getting the various memory amounts.....

The approach I've been using is to gather some data as I'm indexing
to decide whether to flush the indexwriter or not. That is, record the
size change that ramSizeInBytes() returns before I start to index
a document, record the amount after, and keep the worst
ratio around. This got easier when I subclassed IndexWriter and
overrode the add methods.

But it does require that you call into your writer before you start
adding fields to a document to record the start size......

Then I'm requiring that I have 2X the worst case I've seen for the
incoming document, and flushing (perhaps gc-ing) if I don't have
enough.

Mostly, this is to keep from having to experiment with each different
data set that we get to find the right MERGE & etc. factors to use,
I'm not actually entirely sure that this is giving me any measurable
performance gains.

And I think that this is "good enough". What it allows (as does your
approach) is letting the usual cases of much smaller than 20M+ files
to accumulate and flush reasonably efficiently, and not penalizing
my speed by, say, always keeping 250M free or some such. Again,
the critical thing is that you have to call into here *before* you index.

Curiously, I also got ratios of around 7X, so there's a lot going on.

Keep me posted if you come up with anything really cool!

Best
Erick


On 5/3/07, david m <dmataxo@gmail.com> wrote:
>
> Our application includes an indexing server that writes to multiple
> indexes in parallel (each thread writes to a single index). In order
> to avoid an OutOfMemoryError, each request to index a document is
> checked to see if the JVM has enough memory available to index the
> document.
>
> I know that IndexWriter.ramSizeInBytes() can be used to determine how
> much memory was consumed at the conclusion of indexing a document, but
> is there a way to know (or estimate) the peak memory consumed while
> indexing a document?
>
> For example, in a test set I have a 22 MB document where nearly every
> "word" is unique. It has text like this:
>
> 'DestAddrType' bin: 00 0D
> AttributeCustomerID 'Resources'
> AttributeDNIS '7730'
> AttributeUserData [295] 00 0E 00 00..
> 'DNIS_DATA' '323,000,TM,SDM1K5,AAR,,,'
> 'ENV_FLAG' 'P'
> 'T_APP_CODE' 'TM'
> TelephoneLine' '8'
> 'C_CALL_DATE' '01/19/06'
> 'C_START_TIME' '145650'
> 'C_END_TIME' '145710'
> AttributeCallType 2
>
> and so on...
>
> We are indexing a handful of fields for document meta-data - but they
> are tiny compared to the body of the document. Eight of those fields are
> stored (like a messageid, posteddate, typecode).
>
> The body is indexed into a single field. Our Analyzer splits tokens
> based on Character.isLetterOrDigit() and when in uppercase, indexes a
> lowercase version of the term.
>
> After indexing that single document ramSizeInBytes() returns 15.7 MB.
> That seems ok to me.
>
> But for this particular document I found (via trial and error) that
> at -Xmx165m Lucene throws an OutOfMemoryError.
>
> At -Xmx170m the it indexes successfully.
>
> Just before calling addDoc() I see maximum available memory of: 160.5 MB
>
> The 160.5 MB is from this calc:
>
>   Runtime rt = Runtime.getRuntime();
>   long maxAvail = rt.maxMemory() - (rt.totalMemory() - rt.freeMemory());
>
> So it would appear that for this particular document, to avoid an
> OutOfMemoryError I'd need to be certain of having available memory
> approx 7x the doc size.
>
> I could require 7x the doc size available memory for each doc (on
> the assumption my test document is at the extreme), but for more
> typical documents I'd be over-reserving memory with a result of
> reduced throughput (as docs were forced to wait for sufficient
> available memory that they likely don't need).
>
> Instead I'm wondering if there is better way for the index server to
> know (or guesstimate) what the memory requirement will be for each
> document? - so that it doesn't start indexing in parallel more
> documents than available memory can support.
>
> Thanks,
> david.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message