lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Schultz <ch...@christopherschultz.net>
Subject Re: Period on-line index optimization
Date Tue, 27 Nov 2018 17:04:06 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Shawn,

On 11/27/18 11:01, Shawn Heisey wrote:
> On 11/27/2018 7:47 AM, Christopher Schultz wrote:
>> I've got a single-core Solr instance with something like 1M small
>> documents in it. It contains user information for fast-lookups,
>> and it gets updated any time relevant user-info changes.
>> 
>> Here's the basic info from the Core Dashboard:
> 
> <snip>
> 
>> I'm wondering how often it makes sense to "optimize" my index, 
>> because there is plenty of turnover of existing documents. That 
>> is, plenty of existing users update their info and therefore the 
>> Lucene index is being updated as well -- causing a 
>> document-delete and document-add operation to occur. My 
>> understanding is that leaves a lot of dead space over time, and 
>> I'm assuming that it might even slow things down as the ratio of 
>> useful data to total data is reduced.
> 
> The percentage of deleted documents here is fairly low. About 7.6 
> percent.  Doing an optimize with deleted percentage that low may 
> not be worthwhile.
> 
> On the other hand, it *would* improve performance by a little bit 
> to optimize.  For the index with the stats you mentioned, you'd be 
> going from 15 segments to one segment.  And with an index size of 
> under 300 MB, the optimize operation would complete pretty quickly 
> - likely a few minutes, maybe even less than one minute.
Okay. What I really don't want to do is interrupt normal operation.

>> Presumably, optimizing more often will reduce the time to
>> perform a single optimization operation, yes?
> 
> No, not really.  It depends on what documents are in the index,
> not so much on whether an optimization was done previously. 
> Subsequent optimizes will take about as long as the previous 
> optimize did.

So, it's pretty much like GC promotion: the number of live objects is
really the only things that matters?

>> Anyhow, I'd like to know a few things:
>> 
>> 1. Is manually-triggered optimization even worth doing at all?
> 
> Maybe.  See how long it takes, how much impact it has on 
> performance while it's happening, and see if you can get an 
> estimate of how much extra performance you get from it once it's 
> done.  If the impact is low and/or the benefit is high, then by
> all means, optimize regularly.
> 
>> 2. If so, how often? Or, maybe not "how often [in 
>> hours/days/months]" but maybe "how often [in deletes, etc.]"?
> 
> For an index that size, I would say you should aim for an interval
>  between once an hour and once every 24 hours.  Set up this timing 
> based on what kind of impact the optimize operation has on 
> performance while it's occurring.  Might be best to do it once a 
> day at a low activity time, perhaps 03:00.  With indexes slightly 
> bigger than that, I was doing an optimize once an hour. And for
> the bigger indexes, once a day.

I was thinking once per day. AFAIK, this index hasn't been optimized
since it was first built which was a few months ago.

>> 3. During the optimization operation, can clients still issue 
>> (read) queries? If so, will they wait until the optimization 
>> operation has completed?
> 
> Yes.  And as long as you don't use deleteByQuery, you can even 
> update the index while it's optimizing.  The deleteByQuery 
> operation will cause problems, especially when the index gets 
> large.  With your small index size, you might not even notice the 
> problems that mixing optimize and deleteByQuery will cause. 
> Replacing deleteByQuery with a standard query to retrieve ID
> values and then doing a deleteById will get rid of the problems
> that DBQ causes with optimize.

We aren't explicitly deleting anything, ever. The only deletes
occurring should be when we perform an update() on a document, and
Solr/Lucene automatically deletes the existing document with the same id
.

>> 5. Is it possible to abort an optimization operation if it's 
>> taking too long, and simply discard the new data -- basically, 
>> fall-back to the previously-existing index data?
> 
> I am not aware of a way to abort an optimize.  I suppose there 
> might be one ... but in general it doesn't sound like a good idea 
> to me, even if it's possible.
> 
>> 6. What's a good way to trigger an optimization operation? I 
>> didn't see anything directly in the web UI, but there is an 
>> "optimize" method in the Solr/J client. If I can fire-off a 
>> fire-and-forget "optimize" request via e.g. curl or similar tool 
>> rather than writing a Java client, that would be slightly more 
>> convenient for me.
> 
> Removal of the optimize button from the admin UI was completely 
> intentional.  It's such a tempting button ... there's a tendency 
> for people to say to themselves "of COURSE I want to optimize my 
> index, and make that indicator green!"  But optimizing an 50GB 
> index will quite literally take HOURS ... and will dramatically 
> impact overall performance for that whole time.  So we have
> removed the temptation.  We haven't removed the ability to
> optimize, just the button in the UI.

Ack.

> You can use the optimize method in the SolrJ client if your setup 
> is already using SolrJ.  Doing the optimize with something like 
> curl is typically a little bit easier, and won't present a
> problem. Either way, I would arrange for it to happen in the
> background -- a separate thread in a SolrJ program, or the &
> character on the commandline or in a script when using something
> like curl.  Setting the "wait" options on the optimize request to
> false didn't seem to actually lead to an immediate return on the
> request and background operation on the server.  Been wondering if
> I should file a bug on that problem, if I can reproduce it with
> latest Solr.

I'd want to schedule this thing with cron, so curl is better for me.
"nohup optimize &" is fine with me, especially if it will give me
stats on how long the optimization actually took.

I have dev and test environments so I have plenty of places to
play-around. I can even load my production index into dev to see how
long the whole 1M document index will take to optimize, though the
number of segments in the index will be different, unless I just
straight-up copy the index files from the disk. I probably won't do
that because I'd prefer not to take-down the index long enough to take
a copy.

> If deleteByQuery is an essential part of your indexing process, 
> then it would be prudent to avoid indexing while an optimize is 
> underway.  If you do a deleteByQuery during an optimize, then all 
> indexing from that point on will wait until the optimize is done. 
> On a big index, that could be hours.

I'm assuming this isn't going to take very long to optimize, but we'll
see.

You skipped question 4 which was "can I update my index during an
optimization", but you did mention in your answer to question 3 ("can
I still query during optimize?") that I "should" be able to update the
index (e.g. add/update). Can you clarify why you said "should" instead
of "can"?

Thanks again,
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlv9eQUACgkQHPApP6U8
pFiCKQ/+PdXLrf5uK+vdmEh4JuyG2qRbdcLUCykkp9sJEXtAK4sIEWQIa4EK9loZ
Lnf7MuZJCmW59nJzbjHZyvL+kOCl/237tDX0Ovmr39s3bZj/mg6MyXHeoCuwX2zb
60MYqEEaxL/Y8v0pssiGi4gy6xygnk0cVh4Pk8OgkUHbRqKsbZ8CcO7B7wEKnwRX
vNw6d1bNm0JbxFyNGks0gWjlbLL7rNMFpdb2WDnZ92qcGMJS9ffycuD3AZGa//Q2
hC/0cUlWhWX0mkRXFi+1YsXDtUVyEB3DzRhihT7DEnILlji7CYMuqVOs5vms4Z8B
wvcOmrzgYyqlxtpryiEsq/riTAT4C58n/gtY9h1UOs72xemBHBK3QXf+XYceyx8D
iZ/1qlrXzHIYbyQPL9hBvKTTqmAhgv2Uxrx3DwLOI/3aVCuGKx0DPNu7T4Uab+kO
xwaiN0EEQIO30xv3+8+bKUeImz4AvvuEqT9jdLDHyU14cCIQmOpbCmWG1DjakNPC
fEm2ghnNt+9oaL/00zG/kJf26y2mJI0UCo1L5tgzgmGpwIiZ82DenReug9373wpw
1a1cTOfJnecC4fakOHQ4xg3+9iZoVgV6q/ZoOfFx53400eJ1hI82nY39Mb5h0X8l
S9bTPBIfOnPynIdfcbS2Mgiq3Wr9jwd3Lwii/qC4MDv1tTaXBCE=
=Aeq9
-----END PGP SIGNATURE-----

Mime
View raw message