lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1879) Parallel incremental indexing
Date Fri, 06 Nov 2009 17:57:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774338#action_12774338
] 

Michael Busch commented on LUCENE-1879:
---------------------------------------

{quote}
Can you elaborate on this? How is addIndexes* term-at-a-time?
{quote}

Let's say we have an index 1 with two fields a and b and you want to create a new parallel
index 2 in which you copy all posting lists of field b. You can achieve this by using addDocument(),
if you iterate on all posting lists in 1b in parallel and create for each document in 1 a
corresponding document in 2 that contains the terms of the postings lists from 1b that have
a posting for the current document. This I called the "document-at-a-time approach".

However, this is terribly slow (I tried it out), because of all the posting lists you perform
I/O on in parallel. It's far more efficient to copy an entire posting list over from 1b to
2, because then you only perform sequential I/O. And if you use 2.addIndexes(IndexReader(1b)),
then exactly this happens, because addIndexes(IndexReader) uses the SegmentMerger to add the
index. The SegmentMerger iterates the dictionary and consumes the posting lists sequentially.
That's why I called this "term-at-a-time approach". In my experience this is for a similar
use case as the one I described here orders of magnitudes more efficient. My doc-at-a-time
algorithm ran ~20 hours, the term-at-a-time one 8 minutes! The resulting indexes were identical.


> Parallel incremental indexing
> -----------------------------
>
>                 Key: LUCENE-1879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1879
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>             Fix For: 3.1
>
>         Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID
level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message