lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@metaweb.com>
Subject Re: java gc with a frequently changing index?
Date Mon, 30 Jul 2007 20:29:18 GMT
Thanks for the reply Erick,

I believe it is the gc for four reasons:

- I've tried the "warmup" approach alredy and it didn't change the 
situation.

- The server completely pauses for several seconds. I run jstack to find 
out where the pause is, and it also pauses for several seconds before 
telling me the server is doing something perfectly innocuous. If I was 
stuck in some search overhead, I would expect jstack to tell me where 
(and I would expect the where to be somewhere interesting and vaguely 
repeatable)

- The impact is very uneven. Over 50000 queries (sequentially) I get 
49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I 
really would be much happier with a consistent 10msec (which adds up to 
the same amount of time in total) or even 25msec

- "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get 
100 msec and 1 sec pauses instead, but 5x as many for slower overall 
time; 1 sec is far too slow)

Your solution looks possible, but seems really too complex for what I am 
trying to do (which is basic incremental update). What I really am 
looking for is a way to avoid reopening the first segment of my FSDir. I 
have a single 6G segment, and then another 20-50 segments with updates, 
but they are <100M in total size. So if I could have lucene open just 
the segments file and the new or changed *.del and *.cfs files (without 
reopening the unchanged *.cfs files) that would be a huge win for me I 
think.

It strikes me this should be possible with a thin but complex layer 
between the SegmentReader and MultiReader, and perhaps a way to get 
SegmentReader to update what *.del file it is using. I'm just curious 
why this doesn't already exist.

Tim

Erick Erickson wrote:
> Why do you believe that it's the gc? I admit i just scanned your
> e-mail, but I *do* know that the first search (especially sorts) on
> a newly-opened IndexReader incure a bunch of overhead. Could
> that be what you're seeing?
>
> I'm not sure there is a "best practice", but I have seen two
> solutions mentioned, both more complex than opening/closing
> the reader.
> 1> open the reader in the background, fire a few "warmup" queries
> at it, then switch it with the one you actually use to answer queries.
>
> 2> Use a RAMDirectory to hold your new entries for some period
> of time. You'd have to do some fancy dancing to keep this straight
> since you're updating documents, but it might be viable. The scheme
> is something like
> Open your FSDIR
> Open a RAMdir.
>
> Add all new documents to BOTH of them. When servicing a query,
> look in both indexes, but you only open/close the RAMdir for
> every query. Note that since, when you open a reader, it
> takes a snapshot of the index, these two views will be disjoint. When you
> get your results back, you'll have to do something about the documents
> from the FSdir that have been replaced in the RAMdir, which is where
> the fancy dancing part comes in. But I leave that as an exercise for
> the reader.
>
> Periodically, shut everything down and repeat. The point here is that
> you can (probably) close/open your RAMdir with very small costs and
> have the whole thing be up to date.
>
> There'll be some coordination issues, and you'll have to cope with data
> integrity if your process barfs before you've closed your FSDir....
>
> Or, you could ask whether 5 seconds is really necessary.I've seen a lot
> of times when "real time" could be 5 minutes and nobody would really
> complain, and other times when it really is critical. But that's between you
> and our Product Manager....
>
> Hope this helps
> Erick
>
> On 7/25/07, Tim Sturge <tsturge@metaweb.com> wrote:
>   
>> Hi,
>>
>> I am indexing a set of constantly changing documents. The change rate is
>> moderate (about 10 docs/sec over a 10M document collection with a 6G
>> total size) but I want to be  right up to date (ideally within a second
>> but within 5 seconds is acceptable) with the index.
>>
>> Right now I have code that adds new documents to the index and deletes
>> old ones using updateDocument() in the 2.1 IndexWriter. In order to see
>> the changes, I need to recreate the IndexReader/IndexSearcher every
>> second or so. I am not calling optimize() on the index in the writer,
>> and the mergeFactor is 10.
>>
>> The problem I am facing is that java gc is terrible at collecting the
>> IndexSearchers I am discarding. I usually have a 3msec query time, but I
>> get gc pauses of 300msec to 3 sec (I assume is is collecting the
>> "tenured" generation in these pauses, which is my old IndexSearcher)
>>
>> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
>> calling System.gc() right after I close the old index without much luck
>> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
>> pauses). So my question is, should I be avoiding reloading my index in
>> this way? Should I keep a separate IndexReader (which only deletes old
>> documents) and one for new documents? Is there a standard technique for
>> a quickly changing index?
>>
>> Thanks,
>>
>> Tim
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message