lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene update performance
Date Wed, 10 May 2017 17:30:52 GMT
IndexWriter simply buffers that Query you passed to deleteDocuments, so
that's very fast.

Only later on will it (lazily) resolve that Query to the docIDs to delete,
which is the costly part, when a merge wants to kick off, or a refresh, or
a commit.

What Query are you using to identify documents to delete?

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 9, 2017 at 1:13 PM, Kudrettin Güleryüz <kudrettin@gmail.com>
wrote:

> Fair enough, however, I see this:
> $ cat log
> Tue May  9 07:19:45 EDT 2017: Indexing starts
> Tue May  9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files
> Tue May  9 07:49:47 EDT 2017: Deletion complete, Addition starts with
> 1272334 files
>
> $ date
> Tue May  9 13:12:58 EDT 2017
>
> I am using two phase commit model. Deletion logic above utilizes
> writer.deleteDocuments(query), and addition utilizes
> writer.addDocument(doc). Judging simply from this log deletion doesn't seem
> to be taking long. What am I missing?
>
>
> On Tue, May 9, 2017 at 10:23 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> > addDocument can be a significant gain compared to updateDocument as
> doing a
> > PK lookup on a unique field has a cost that is not negligible compared to
> > indexing a document, especially if the indexing chain is simple (no large
> > text fields with complex analyzers). Reindexing in place will also cause
> > more merging. Overall I find the 3x factor a bit high, but not too
> > surprising if documents and the analysis chain are simple, and/or if
> > storage is slow.
> >
> > Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <rob.audenaerde@gmail.com> a
> > écrit :
> >
> > > As far as I know, the updateDocument method on the IndexWriter delete
> and
> > > add. See also the javadoc:
> > >
> > > [..] Updates a document by first deleting the document(s)
> > >     containing term and then adding the new
> > >     document.  The delete and then add are atomic as seen
> > >     by a reader on the same index (flush may happen only after
> > >     the add). [..]
> > >
> > >
> > > On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz <
> kudrettin@gmail.com>
> > > wrote:
> > >
> > > > I do update the entire document each time. Furthermore, this
> sometimes
> > > > means deleting compressed archives which are stores as multiple
> > documents
> > > > for each compressed archive file and readding them.
> > > >
> > > > Is there an update method, is it better performance than remove then
> > > add? I
> > > > was simply removing modified files from the index (which doesn't seem
> > to
> > > > take long), and readd them.
> > > >
> > > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde <
> > rob.audenaerde@gmail.com>
> > > > wrote:
> > > >
> > > > > Do you update each entire document? (vs updating numeric
> docvalues?)
> > > > >
> > > > > That is implemented as 'delete and add' so I guess that will be
> > slower
> > > > than
> > > > > clean sheet indexing. Not sure if it is 3x slower, that seems a bit
> > > much?
> > > > >
> > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz <
> > > kudrettin@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > For a 5.2.1 index that contains around 1.2 million documents,
> > > updating
> > > > > the
> > > > > > index with 1.3 million files seems to take 3X longer than doing
a
> > > > scratch
> > > > > > indexing. (Files are crawled over NFS, indexes are stored on
a
> > > > mechanical
> > > > > > disk locally (Btrfs))
> > > > > >
> > > > > > Is this expected for Lucene's update index logic, or should
I
> > further
> > > > > debug
> > > > > > my part of the code for update performance?
> > > > > >
> > > > > > Thank you,
> > > > > > Kudret
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message