lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges
Date Tue, 21 Jul 2009 20:57:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733822#action_12733822
] 

Michael McCandless commented on LUCENE-1076:
--------------------------------------------

bq.  if I merge two consecutive segments, how come I don't merge their doc stores

Multiple segments are able to share a single set of doc-store (=
stored fields & term vectors) files, today.  This only happens when
multiple segments are written in a single IndexWriter session with
autoCommit=false.

EG if I open a writer, index all of wikipedia w/ autoCommit false, and
close it, you'll see a single large set of doc store files (eg _0.fdt,
_0.fdx, _0.tvf, _0.tvd, _0.tvx).

Whenever RAM is full (with postings & norms data), a new segment is
flushed, but the doc store files are kept open & shared with further
flushed segments.

A single segment then refers to the shared doc stores, but records its
"offset" within them.

So, when we merge contiguous segments, because the resulting docs are
also contiguous in the doc stores, we are able to store a single doc
store offset in the merged segment, referencing the orignial doc
store, and it works fine.

But if we merge non-contiguous segments, we must then pull out & merge
the "slices" from the doc stores into a new [private to the new
segment] set of doc store files.

For apps that store term vectors w/ positions & offsets, and have many
stored fields, and have heterogenous field name -> number assignments
(see LUCENE-1737 to fix that), the merging of doc stores can easily
dominate the merge cost.


> Allow MergePolicy to select non-contiguous merges
> -------------------------------------------------
>
>                 Key: LUCENE-1076
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1076
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1076.patch
>
>
> I started work on this but with LUCENE-1044 I won't make much progress
> on it for a while, so I want to checkpoint my current state/patch.
> For backwards compatibility we must leave the default MergePolicy as
> selecting contiguous merges.  This is necessary because some
> applications rely on "temporal monotonicity" of doc IDs, which means
> even though merges can re-number documents, the renumbering will
> always reflect the order in which the documents were added to the
> index.
> Still, for those apps that do not rely on this, we should offer a
> MergePolicy that is free to select the best merges regardless of
> whether they are continuguous.  This requires fixing IndexWriter to
> accept such a merge, and, fixing LogMergePolicy to optionally allow
> it the freedom to do so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message