lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kudrettin Güleryüz <kudret...@gmail.com>
Subject Re: Lucene update performance
Date Wed, 10 May 2017 18:20:27 GMT
I see, makes better sense now.

The query is a BooleanQuery. Here is what I do:
https://gist.github.com/Kudret/56879bf30fa129e752895305e1db5a80





On Wed, May 10, 2017 at 1:31 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> IndexWriter simply buffers that Query you passed to deleteDocuments, so
> that's very fast.
>
> Only later on will it (lazily) resolve that Query to the docIDs to delete,
> which is the costly part, when a merge wants to kick off, or a refresh, or
> a commit.
>
> What Query are you using to identify documents to delete?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, May 9, 2017 at 1:13 PM, Kudrettin Güleryüz <kudrettin@gmail.com>
> wrote:
>
>> Fair enough, however, I see this:
>> $ cat log
>> Tue May  9 07:19:45 EDT 2017: Indexing starts
>> Tue May  9 07:32:33 EDT 2017: Deletion starts with a list of 1278635 files
>> Tue May  9 07:49:47 EDT 2017: Deletion complete, Addition starts with
>> 1272334 files
>>
>> $ date
>> Tue May  9 13:12:58 EDT 2017
>>
>> I am using two phase commit model. Deletion logic above utilizes
>> writer.deleteDocuments(query), and addition utilizes
>> writer.addDocument(doc). Judging simply from this log deletion doesn't
>> seem
>> to be taking long. What am I missing?
>>
>>
>> On Tue, May 9, 2017 at 10:23 AM Adrien Grand <jpountz@gmail.com> wrote:
>>
>> > addDocument can be a significant gain compared to updateDocument as
>> doing a
>> > PK lookup on a unique field has a cost that is not negligible compared
>> to
>> > indexing a document, especially if the indexing chain is simple (no
>> large
>> > text fields with complex analyzers). Reindexing in place will also cause
>> > more merging. Overall I find the 3x factor a bit high, but not too
>> > surprising if documents and the analysis chain are simple, and/or if
>> > storage is slow.
>> >
>> > Le mar. 9 mai 2017 à 16:06, Rob Audenaerde <rob.audenaerde@gmail.com>
a
>> > écrit :
>> >
>> > > As far as I know, the updateDocument method on the IndexWriter delete
>> and
>> > > add. See also the javadoc:
>> > >
>> > > [..] Updates a document by first deleting the document(s)
>> > >     containing term and then adding the new
>> > >     document.  The delete and then add are atomic as seen
>> > >     by a reader on the same index (flush may happen only after
>> > >     the add). [..]
>> > >
>> > >
>> > > On Tue, May 9, 2017 at 3:37 PM, Kudrettin Güleryüz <
>> kudrettin@gmail.com>
>> > > wrote:
>> > >
>> > > > I do update the entire document each time. Furthermore, this
>> sometimes
>> > > > means deleting compressed archives which are stores as multiple
>> > documents
>> > > > for each compressed archive file and readding them.
>> > > >
>> > > > Is there an update method, is it better performance than remove then
>> > > add? I
>> > > > was simply removing modified files from the index (which doesn't
>> seem
>> > to
>> > > > take long), and readd them.
>> > > >
>> > > > On Tue, May 9, 2017 at 9:33 AM Rob Audenaerde <
>> > rob.audenaerde@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Do you update each entire document? (vs updating numeric
>> docvalues?)
>> > > > >
>> > > > > That is implemented as 'delete and add' so I guess that will
be
>> > slower
>> > > > than
>> > > > > clean sheet indexing. Not sure if it is 3x slower, that seems
a
>> bit
>> > > much?
>> > > > >
>> > > > > On Tue, May 9, 2017 at 3:24 PM, Kudrettin Güleryüz <
>> > > kudrettin@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > For a 5.2.1 index that contains around 1.2 million documents,
>> > > updating
>> > > > > the
>> > > > > > index with 1.3 million files seems to take 3X longer than
doing
>> a
>> > > > scratch
>> > > > > > indexing. (Files are crawled over NFS, indexes are stored
on a
>> > > > mechanical
>> > > > > > disk locally (Btrfs))
>> > > > > >
>> > > > > > Is this expected for Lucene's update index logic, or should
I
>> > further
>> > > > > debug
>> > > > > > my part of the code for update performance?
>> > > > > >
>> > > > > > Thank you,
>> > > > > > Kudret
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message