lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Fri, 23 Mar 2007 16:00:41 GMT

"Yonik Seeley" <yonik@apache.org> wrote:
> On 3/22/07, Michael McCandless <lucene@mikemccandless.com> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read in, written out) as
> > part of a merge.
> 
> Avoiding the re-writing of stored fields might be nice:
> http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html

That's exactly the approach I'm taking in LUCENE-843: stored fields and term
vectors are immediately written to disk.  Only frq, prx and tis use up
memory.  This greatly extends how many docs you can buffer before
having to flush (assuming your docs have stored fields and term
vectors).

When memory is full, I either flush a segment to disk (when writer is
in autoCommit=true mode), else I flush the data to tmp files which are
finally merged into a segment when the writer is closed.  This merging
is less costly because the bytes in/out are just frq, prx and tis, so
this improves performance of autoCommit=false mode vs autoCommit=true
mode.

But, this is only for the segment created from buffered docs (ie the
segment created by a "flush").  Subsequent merges still must copy
bytes in/out and in LUCENE-843 I haven't changed anything about how
segments are merged.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message