lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Sturge <tstu...@metaweb.com>
Subject Re: java gc with a frequently changing index?
Date Mon, 30 Jul 2007 21:16:26 GMT
Oh, yeah, I know now :-).  But I really do have a requirement to show 
search results from items that came in 5 seconds ago. We have an 
application where a common usage pattern is

add an item
navigate to another item
search for the first item (to associate it with the second item)

and the gap between step 1 and step 3 is not very long.

Right now, I get a notification within a few hundred msec that the item 
has been added. I just don't see why it is hard (in theory anyway, 
lucene's implementation notwithstanding) to put that on the end of the 
index I'm currently searching. I have lots of CPU available.

Can you tell me the JIRA issue? What kind of patch would lucene devs be 
likely to accept (do I need to get it 100% done, or is 80% of the way 
interesting?)

Tim

Mark Miller wrote:
> And by the way, I cannot see it ever making sense to keep reopening an index
> reader every second or so. It has to be MUCH more efficient to even wait
> every 2 or 4 seconds...even that is going to be pretty nasty, but you have
> to allow for a bit of batch man. You will waste so much time opening those
> readers that its not going to be real-time anyway. You are just going to be
> in a world of slow.
>
> On 7/30/07, Mark Miller <markrmiller@gmail.com> wrote:
>   
>> I believe there is an issue in JIRA that handles reopening an IndexReader
>> without reopening segments that have not changed.
>>
>> On 7/30/07, Tim Sturge < tsturge@metaweb.com> wrote:
>>     
>>> Thanks for the reply Erick,
>>>
>>> I believe it is the gc for four reasons:
>>>
>>> - I've tried the "warmup" approach alredy and it didn't change the
>>> situation.
>>>
>>> - The server completely pauses for several seconds. I run jstack to find
>>> out where the pause is, and it also pauses for several seconds before
>>> telling me the server is doing something perfectly innocuous. If I was
>>> stuck in some search overhead, I would expect jstack to tell me where
>>> (and I would expect the where to be somewhere interesting and vaguely
>>> repeatable)
>>>
>>> - The impact is very uneven. Over 50000 queries (sequentially) I get
>>> 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
>>> really would be much happier with a consistent 10msec (which adds up to
>>> the same amount of time in total) or even 25msec
>>>
>>> - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
>>> 100 msec and 1 sec pauses instead, but 5x as many for slower overall
>>> time; 1 sec is far too slow)
>>>
>>> Your solution looks possible, but seems really too complex for what I am
>>> trying to do (which is basic incremental update). What I really am
>>> looking for is a way to avoid reopening the first segment of my FSDir. I
>>>
>>> have a single 6G segment, and then another 20-50 segments with updates,
>>> but they are <100M in total size. So if I could have lucene open just
>>> the segments file and the new or changed *.del and *.cfs files (without
>>> reopening the unchanged *.cfs files) that would be a huge win for me I
>>> think.
>>>
>>> It strikes me this should be possible with a thin but complex layer
>>> between the SegmentReader and MultiReader, and perhaps a way to get
>>> SegmentReader to update what *.del file it is using. I'm just curious
>>> why this doesn't already exist.
>>>
>>> Tim
>>>
>>> Erick Erickson wrote:
>>>       
>>>> Why do you believe that it's the gc? I admit i just scanned your
>>>> e-mail, but I *do* know that the first search (especially sorts) on
>>>> a newly-opened IndexReader incure a bunch of overhead. Could
>>>> that be what you're seeing?
>>>>
>>>> I'm not sure there is a "best practice", but I have seen two
>>>> solutions mentioned, both more complex than opening/closing
>>>> the reader.
>>>> 1> open the reader in the background, fire a few "warmup" queries
>>>> at it, then switch it with the one you actually use to answer queries.
>>>>         
>>>> 2> Use a RAMDirectory to hold your new entries for some period
>>>> of time. You'd have to do some fancy dancing to keep this straight
>>>> since you're updating documents, but it might be viable. The scheme
>>>> is something like
>>>> Open your FSDIR
>>>> Open a RAMdir.
>>>>
>>>> Add all new documents to BOTH of them. When servicing a query,
>>>> look in both indexes, but you only open/close the RAMdir for
>>>> every query. Note that since, when you open a reader, it
>>>> takes a snapshot of the index, these two views will be disjoint. When
>>>>         
>>> you
>>>       
>>>> get your results back, you'll have to do something about the documents
>>>>         
>>>> from the FSdir that have been replaced in the RAMdir, which is where
>>>> the fancy dancing part comes in. But I leave that as an exercise for
>>>> the reader.
>>>>
>>>> Periodically, shut everything down and repeat. The point here is that
>>>> you can (probably) close/open your RAMdir with very small costs and
>>>> have the whole thing be up to date.
>>>>
>>>> There'll be some coordination issues, and you'll have to cope with
>>>>         
>>> data
>>>       
>>>> integrity if your process barfs before you've closed your FSDir....
>>>>
>>>> Or, you could ask whether 5 seconds is really necessary.I've seen a
>>>>         
>>> lot
>>>       
>>>> of times when "real time" could be 5 minutes and nobody would really
>>>> complain, and other times when it really is critical. But that's
>>>>         
>>> between you
>>>       
>>>> and our Product Manager....
>>>>
>>>> Hope this helps
>>>> Erick
>>>>
>>>> On 7/25/07, Tim Sturge <tsturge@metaweb.com> wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I am indexing a set of constantly changing documents. The change rate
>>>>>           
>>> is
>>>       
>>>>> moderate (about 10 docs/sec over a 10M document collection with a 6G
>>>>> total size) but I want to be  right up to date (ideally within a
>>>>>           
>>> second
>>>       
>>>>> but within 5 seconds is acceptable) with the index.
>>>>>
>>>>> Right now I have code that adds new documents to the index and
>>>>>           
>>> deletes
>>>       
>>>>> old ones using updateDocument() in the 2.1 IndexWriter. In order to
>>>>>           
>>> see
>>>       
>>>>> the changes, I need to recreate the IndexReader/IndexSearcher every
>>>>> second or so. I am not calling optimize() on the index in the writer,
>>>>> and the mergeFactor is 10.
>>>>>
>>>>> The problem I am facing is that java gc is terrible at collecting the
>>>>>           
>>>>> IndexSearchers I am discarding. I usually have a 3msec query time,
>>>>>           
>>> but I
>>>       
>>>>> get gc pauses of 300msec to 3 sec (I assume is is collecting the
>>>>> "tenured" generation in these pauses, which is my old IndexSearcher)
>>>>>
>>>>> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
>>>>> calling System.gc() right after I close the old index without much
>>>>>           
>>> luck
>>>       
>>>>> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
>>>>> pauses). So my question is, should I be avoiding reloading my index
>>>>>           
>>> in
>>>       
>>>>> this way? Should I keep a separate IndexReader (which only deletes
>>>>>           
>>> old
>>>       
>>>>> documents) and one for new documents? Is there a standard technique
>>>>>           
>>> for
>>>       
>>>>> a quickly changing index?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>       
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message