lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller" <markrmil...@gmail.com>
Subject Re: java gc with a frequently changing index?
Date Mon, 30 Jul 2007 20:23:32 GMT
And by the way, I cannot see it ever making sense to keep reopening an index
reader every second or so. It has to be MUCH more efficient to even wait
every 2 or 4 seconds...even that is going to be pretty nasty, but you have
to allow for a bit of batch man. You will waste so much time opening those
readers that its not going to be real-time anyway. You are just going to be
in a world of slow.

On 7/30/07, Mark Miller <markrmiller@gmail.com> wrote:
>
> I believe there is an issue in JIRA that handles reopening an IndexReader
> without reopening segments that have not changed.
>
> On 7/30/07, Tim Sturge < tsturge@metaweb.com> wrote:
> >
> > Thanks for the reply Erick,
> >
> > I believe it is the gc for four reasons:
> >
> > - I've tried the "warmup" approach alredy and it didn't change the
> > situation.
> >
> > - The server completely pauses for several seconds. I run jstack to find
> > out where the pause is, and it also pauses for several seconds before
> > telling me the server is doing something perfectly innocuous. If I was
> > stuck in some search overhead, I would expect jstack to tell me where
> > (and I would expect the where to be somewhere interesting and vaguely
> > repeatable)
> >
> > - The impact is very uneven. Over 50000 queries (sequentially) I get
> > 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
> > really would be much happier with a consistent 10msec (which adds up to
> > the same amount of time in total) or even 25msec
> >
> > - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
> > 100 msec and 1 sec pauses instead, but 5x as many for slower overall
> > time; 1 sec is far too slow)
> >
> > Your solution looks possible, but seems really too complex for what I am
> > trying to do (which is basic incremental update). What I really am
> > looking for is a way to avoid reopening the first segment of my FSDir. I
> >
> > have a single 6G segment, and then another 20-50 segments with updates,
> > but they are <100M in total size. So if I could have lucene open just
> > the segments file and the new or changed *.del and *.cfs files (without
> > reopening the unchanged *.cfs files) that would be a huge win for me I
> > think.
> >
> > It strikes me this should be possible with a thin but complex layer
> > between the SegmentReader and MultiReader, and perhaps a way to get
> > SegmentReader to update what *.del file it is using. I'm just curious
> > why this doesn't already exist.
> >
> > Tim
> >
> > Erick Erickson wrote:
> > > Why do you believe that it's the gc? I admit i just scanned your
> > > e-mail, but I *do* know that the first search (especially sorts) on
> > > a newly-opened IndexReader incure a bunch of overhead. Could
> > > that be what you're seeing?
> > >
> > > I'm not sure there is a "best practice", but I have seen two
> > > solutions mentioned, both more complex than opening/closing
> > > the reader.
> > > 1> open the reader in the background, fire a few "warmup" queries
> > > at it, then switch it with the one you actually use to answer queries.
> >
> > >
> > > 2> Use a RAMDirectory to hold your new entries for some period
> > > of time. You'd have to do some fancy dancing to keep this straight
> > > since you're updating documents, but it might be viable. The scheme
> > > is something like
> > > Open your FSDIR
> > > Open a RAMdir.
> > >
> > > Add all new documents to BOTH of them. When servicing a query,
> > > look in both indexes, but you only open/close the RAMdir for
> > > every query. Note that since, when you open a reader, it
> > > takes a snapshot of the index, these two views will be disjoint. When
> > you
> > > get your results back, you'll have to do something about the documents
> >
> > > from the FSdir that have been replaced in the RAMdir, which is where
> > > the fancy dancing part comes in. But I leave that as an exercise for
> > > the reader.
> > >
> > > Periodically, shut everything down and repeat. The point here is that
> > > you can (probably) close/open your RAMdir with very small costs and
> > > have the whole thing be up to date.
> > >
> > > There'll be some coordination issues, and you'll have to cope with
> > data
> > > integrity if your process barfs before you've closed your FSDir....
> > >
> > > Or, you could ask whether 5 seconds is really necessary.I've seen a
> > lot
> > > of times when "real time" could be 5 minutes and nobody would really
> > > complain, and other times when it really is critical. But that's
> > between you
> > > and our Product Manager....
> > >
> > > Hope this helps
> > > Erick
> > >
> > > On 7/25/07, Tim Sturge <tsturge@metaweb.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I am indexing a set of constantly changing documents. The change rate
> > is
> > >> moderate (about 10 docs/sec over a 10M document collection with a 6G
> > >> total size) but I want to be  right up to date (ideally within a
> > second
> > >> but within 5 seconds is acceptable) with the index.
> > >>
> > >> Right now I have code that adds new documents to the index and
> > deletes
> > >> old ones using updateDocument() in the 2.1 IndexWriter. In order to
> > see
> > >> the changes, I need to recreate the IndexReader/IndexSearcher every
> > >> second or so. I am not calling optimize() on the index in the writer,
> > >> and the mergeFactor is 10.
> > >>
> > >> The problem I am facing is that java gc is terrible at collecting the
> >
> > >> IndexSearchers I am discarding. I usually have a 3msec query time,
> > but I
> > >> get gc pauses of 300msec to 3 sec (I assume is is collecting the
> > >> "tenured" generation in these pauses, which is my old IndexSearcher)
> > >>
> > >> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
> > >> calling System.gc() right after I close the old index without much
> > luck
> > >> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
> > >> pauses). So my question is, should I be avoiding reloading my index
> > in
> > >> this way? Should I keep a separate IndexReader (which only deletes
> > old
> > >> documents) and one for new documents? Is there a standard technique
> > for
> > >> a quickly changing index?
> > >>
> > >> Thanks,
> > >>
> > >> Tim
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message