lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Can we change forceMerge to not need as much disk space?
Date Fri, 13 Dec 2019 22:06:07 GMT
Coming back to this after a while.

Opening a new searcher is a sticky wicket. Say you’re merging segments 1, 2, 3 into 4. readers
have handles to segments 1, 2, 3. Even after 4 is created and 1, 2, and 3 are deleted, until
the searcher is closed the disk files still hang around.

That said, I wonder if we start the forceMerge at time T. It only operates on already-closed
segments as it stands. So theoretically, opening and closing readers as new segments were
created would change nothing in terms of search results _assuming_ that there was no indexing
happening. All the docs that were visible in 1, 2, and 3 will also be the only ones visible
in 4.

The only case where users would be surprised is if there were ongoing indexing going on and
they weren’t opening searchers, just committing with openSearcher=false.

Thinking about this some more, I think it’s reasonable to say “If you forceMerge while
indexing is happening, new documents will appear even if you don’t do an explicit commit”.
From Solr’s perspective, it’s something of an anti-pattern to index for a long time without
opening a new searcher, as internal structures to support Real Time Get grow until there’s
a new searcher opened.

And since we discourage forceMerge in the first place, I could live with that.

FWIW.

> On Sep 13, 2019, at 3:58 PM, Shawn Heisey <apache@elyograg.org> wrote:
> 
> On 9/2/2019 9:19 AM, Erick Erickson wrote:
>> Anyway, it occurred to me that once a max-sized segment is created, _if_ we write
the segments_n file out with the current state of the index, we could freely delete the segments
that were merged into the new one. With 300G indexes (which I see regularly in the field,
even multiple ones per node that size), this could result in substantial disk savings.
> 
> <snip>
> 
>> Off the top of my head, I can see some concerns:
>> 1> we’d have to open new searchers every time we wrote the segments_n file to
release file handles on the old segments
> 
> How would that interact with user applications that normally handle opening new searchers
(such as Solr)?  When users want there to be no new searchers until they issue an explicit
commit, I think they're going to be a little irritated if Lucene decides to open a new searcher
on its own.  Maybe we'd need to advise people to turn off their indexing anytime they're doing
a forceMerge/optimize.  That's generally a good idea anyway, and pretty much required if deleteByQuery
is being used.
> 
>> 2> coordinating multiple merge threads
> 
> I would think the scheduler already handles that ... thinking about all this makes my
brain hurt ... if I have to think about the scheduler too, there might be implosions. :)
> 
>> 3> maxMergeAtOnceExplicit could mean unnecessary thrashing/opening searchers (could
this be deprecated?)
> 
> It has always bothered me that when I looked for info about changing the policy settings,
and set the two "main" parts of the policy to 35 (instead of the default 10), that the info
I was finding never mentioned maxMergeAtOnceExplicit.  I also needed to set this value (to
105) to have an optimize work like I expected.  Without it, a lot more merging occurred than
was necessary when I did an optimize.  This was on a really old version of Solr, either 1.4.x
or 3.2.x, back when it was relatively new.
> 
> The maxMergeAtOnceExplicit setting is not even mentioned in the Solr ref guide page about
IndexConfig.  I got the information for that setting from solr-user, when I asked why an optimize
with values increased from 10 to 35 was doing more merge passes than I thought it needed.
 I think that either that parameter needs to go away or docs need improvement.
> 
>> 4> Don’t quite know what to do if maxSegments is 1 (or other very low number).
> 
> I don't think anything can be done about disk usage for that.  Just the nature of the
beast.
> 
>> Something like this would also pave the way for “background optimizing”. Instead
of a monolithic forceMerge, I can envision a process whereby we created a low-level task that
merged one max-sized segment at a time, came up for air and reopened searchers then went back
in and merged the next one. With its own problems about coordinating ongoing updates, but
that’s another discussion ;).
> 
> As mentioned above, I worry about low-level code opening new searchers because lots of
users want to have that be completely under their control.  Maybe TMP needs another setting
to tell it whether or not it's allowed to open searchers, with documentation saying that less
disk space might be required if it is allowed.
> 
> It would be awesome to eliminate the huge forceMerge disk requirement for most users,
so I think it's worth exploring.  Can the stuff with readers that Mike mentioned happen without
opening a new searcher at the app level?  My knowledge of Lucene internals is unfortunately
too vague to answer my own question.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message