lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8004) IndexUpgraderTool should rewrite segments rather than forceMerge
Date Sat, 12 May 2018 22:04:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473283#comment-16473283
] 

Erick Erickson commented on LUCENE-8004:
----------------------------------------

Marking as blocker since I think there are some observations here that should be discussed
and resolved before releasing 7.4. Related to LUCENE-7976.

1> By default it uses TMP and merges down to one segment. It's ironic that we say "don't
optimize", then provide a tool that does exactly that.

2> if we are respecting max segment size (the new default), TMP will not attempt to merge
segments that are >= the max segment size _and_ have no deleted docs.

3> <1> and <2> mean that the bit at the end of {{UpgradeIndexMergePolicy.findForcedMerges}}
where it collects all the old segments not returned by {{in.findForcedMerges}} into one segment
has the potential to create a very large segments, which is trappy.

Straw-man proposal:

I really dislike constructs like

{{ if (instanceof TieredMergePolicy) { do one thing} else { do something else} }}

Seems like that knowledge should be built into the classes themselves. WDYT about a new method
or two in MergePolicy? They'd be no-ops for everything except TMP at this point. We need a
couple of things:

1> hard-coding {{w.forceMerge(1);}} in IndexUpgrader is evil so I'm thinking of something
like {{w.forceMerge(iwc.getMergePolicy().getDefaultForceMergeSegmentCount());}} instead. It
would return 1 by default. Return default in MergePolicy and override only in TMP? Make it
abstract and override everywhere?

2a> We need a way to tell TMP.findForcedMerges to return segments even if they're large
and have no deletes in this case. Another (expert) method on MergePolicy set/getUsingForUpgrade?
I don't particularly like that at all

2b.1> change UpgradeIndexMergePolicy.findForcedMerges to singleton-merge all leftover segments
rather than merge them all into a single segment. I'd argue that this is "more correct"; Anything
findForcedMerges leaves un-merged was determined by the merge policy NOT to be "cheap" so
we should respect that.

2.b.2> change UpgradeIndexMergePolicy to "somehow" respect max segment size when it gathers
the segments into at the end of findForcedMerges(). This pollutes MergePolicy some more.

2.b.3> Add a new method to MergePolicy {{findRewriteAllSegments}} or something. It'd look
just like {{findForcedMerges}} except it would "understand" that every segment would need
to be rewritten, even large ones with no deleted documents. This would make <1> unnecessary.
For everything except TMP it would be a passthrough to {{findForcedMerges}} at this point.

I don't much like changing MergePolicy, it's nice and clean and adding methods to get specifics
from each subclass pollutes it a bit. So <2.b.2> at least doesn't require that the caller
understand anything about the guts of the class.

Also keep in mind that distributed systems have a hard time using the IndexUpgrader on every
node in the system, so we need to have something that is easy to invoke in that environment,
see the discussion at LUCENE-8264.

I'm totally open to any better ideas, haven't started work on this yet.

> IndexUpgraderTool should rewrite segments rather than forceMerge
> ----------------------------------------------------------------
>
>                 Key: LUCENE-8004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8004
>             Project: Lucene - Core
>          Issue Type: Improvement
>         Environment: Marking as blocker since I think there are some observations here
that should be discussed and resolved before releasing 7.4. Related to LUCENE-7976.
> 1> By default it uses TMP and merges down to one segment. It's ironic that we say
"don't optimize", then provide a tool that does exactly that.
> 2> if we are respecting max segment size (the new default), TMP will not attempt to
merge segments that are >= the max segment size _and_ have no deleted docs.
> 3> <1> and <2> mean that the bit at the end of {{UpgradeIndexMergePolicy.findForcedMerges}}
where it collects all the old segments not returned by {{in.findForcedMerges}} into one segment
has the potential to create a very large segments, which is trappy.
> Straw-man proposal:
> I really dislike constructs like
> {{ if (instanceof TieredMergePolicy) { do one thing} else { do something else} }}
> Seems like that knowledge should be built into the classes themselves. WDYT about a new
method or two in MergePolicy? They'd be no-ops for everything except TMP at this point. We
need a couple of things:
> 1> hard-coding {{w.forceMerge(1);}} in IndexUpgrader is evil so I'm thinking of something
like {{w.forceMerge(iwc.getMergePolicy().getDefaultForceMergeSegmentCount());}} instead. It
would return 1 by default. Return default in MergePolicy and override only in TMP? Make it
abstract and override everywhere?
> 2a> We need a way to tell TMP.findForcedMerges to return segments even if they're
large and have no deletes in this case. Another (expert) method on MergePolicy set/getUsingForUpgrade?
I don't particularly like that at all
> 2b.1> change UpgradeIndexMergePolicy.findForcedMerges to singleton-merge all leftover
segments rather than merge them all into a single segment. I'd argue that this is "more correct";
Anything findForcedMerges leaves un-merged was determined by the merge policy NOT to be "cheap"
so we should respect that.
> 2.b.2> change UpgradeIndexMergePolicy to "somehow" respect max segment size when it
gathers the segments into at the end of findForcedMerges(). This pollutes MergePolicy some
more.
> 2.b.3> Add a new method to MergePolicy {{findRewriteAllSegments}} or something. It'd
look just like {{findForcedMerges}} except it would "understand" that every segment would
need to be rewritten, even large ones with no deleted documents. This would make <1>
unnecessary. For everything except TMP it would be a passthrough to {{findForcedMerges}} at
this point.
> I don't much like changing MergePolicy, it's nice and clean and adding methods to get
specifics from each subclass pollutes it a bit. So <2.b.2> at least doesn't require
that the caller understand anything about the guts of the class.
> Also keep in mind that distributed systems have a hard time using the IndexUpgrader on
every node in the system, so we need to have something that is easy to invoke in that environment,
see the discussion at LUCENE-8264.
> I'm totally open to any better ideas, haven't started work on this yet.
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Blocker
>
> Spinoff from LUCENE-7976. We help users get themselves into a corner by using forceMerge
on an index to rewrite all segments in the current Lucene format. We should rewrite each individual
segment instead. This would also help with upgrading X-2->X-1, then X-1->X.
> Of course the preferred method is to re-index from scratch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message