lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4752) Merge segments to sort them
Date Mon, 04 Mar 2013 10:45:36 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592107#comment-13592107
] 

Adrien Grand commented on LUCENE-4752:
--------------------------------------

I think a very simple first step could be have an experimental IndexWriterConfig option to
tell IndexWriter to provide an atomic sorted view (easy once LUCENE-3918 is committed) of
the segments to merge to SegmentMerger instead of the segments themselves (see calls to SegmentMerger.add(SegmentReader)
in IndexWriter.mergeMiddle). Initial segments would remain unsorted, but the big ones, those
that are interesting for both index compression and early query termination, would be sorted.

It can seem inefficient to sort segments over and over but I don't think we should worry too
much:
 - if we are merging "initial" segments (those created from IndexWriter.flush), they would
be small so sorting/merging them would be fast?
 - if we are merging big segments, I think that the following reasons could make merging slower
than a regular merge:
   1. computing the new doc ID mapping,
   2. random I/O access,
   3. not being able to use the specialized codec merging methods.

But if the segments to merge are sorted, computing the new doc ID mapping could be actually
fast (some sorting algorithms such as [TimSort|http://en.wikipedia.org/wiki/Timsort] perform
in O(n) when the input is a succession of sorted sequences), and the access patterns to the
individual segments would be I/O cache-friendly (because each segment would be read sequentially).
So I think this approach could be fast enough?
                
> Merge segments to sort them
> ---------------------------
>
>                 Key: LUCENE-4752
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4752
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: David Smiley
>            Assignee: Adrien Grand
>
> It would be awesome if Lucene could write the documents out in a segment based on a configurable
order.  This of course applies to merging segments to. The benefit is increased locality on
disk of documents that are likely to be accessed together.  This often applies to documents
near each other in time, but also spatially.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message