lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Joyner <mich...@newsrx.com>
Subject Re: Shard size variation
Date Thu, 03 May 2018 14:02:09 GMT
We generally try not to change defaults when possible, sounds like there 
will be new default settings for the segment sizes and merging policy?

Am I right in thinking that expungeDeletes will (in theory) be a 7.4 
forwards option?


On 05/02/2018 01:29 PM, Erick Erickson wrote:
> You can always increase the maximum segment size. For large indexes
> that should reduce the number of segments. But watch your indexing
> stats, I can't predict the consequences of bumping it to 100G for
> instance. I'd _expect_  bursty I/O whne those large segments started
> to be created or merged....
>
> You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably)
> the idea of increasing the segment sizes and/or a related JIRA that
> allows you to tweak how aggressively solr merges segments that have
> deleted docs.
>
> NOTE: that JIRA has the consequence that _by default_ the optimize
> with no parameters respects the maximum segment size, which is a
> change from now.
>
> Finally, expungeDeletes may be useful as that too will respect max
> segment size, again after LUCENE-7976 is committed.
>
> Best,
> Erick
>
> On Wed, May 2, 2018 at 9:22 AM, Michael Joyner <michael@newsrx.com> wrote:
>> The main reason we go this route is that after awhile (with default
>> settings) we end up with hundreds of shards and performance of course drops
>> abysmally as a result. By using a stepped optimize a) we don't run into the
>> we need the 3x+ head room issue, b) optimize performance penalty during
>> optimize is less than the hundreds of shards not being optimized performance
>> penalty.
>>
>> BTW, as we use batched a batch insert/update cycle [once daily] we only do
>> optimize to a segment of 1 after a complete batch has been run. Though
>> during the batch we reduce segment counts down to a max of 16 every 250K
>> insert/updates to prevent the large segment count performance penalty.
>>
>>
>> On 04/30/2018 07:10 PM, Erick Erickson wrote:
>>> There's really no good way to purge deleted documents from the index
>>> other than to wait until merging happens.
>>>
>>> Optimize/forceMerge and expungeDeletes both suffer from the problem
>>> that they create massive segments that then stick around for a very
>>> long time, see:
>>>
>>> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner <michael@newsrx.com>
>>> wrote:
>>>> Based on experience, 2x head room is room is not always enough, sometimes
>>>> not even 3x, if you are optimizing from many segments down to 1 segment
>>>> in a
>>>> single go.
>>>>
>>>> We have however figured out a way that can work with as little as 51%
>>>> free
>>>> space via the following iteration cycle:
>>>>
>>>> public void solrOptimize() {
>>>>           int initialMaxSegments = 256;
>>>>           int finalMaxSegments = 1;
>>>>           if (isShowSegmentCounter()) {
>>>>               log.info("Optimizing ...");
>>>>           }
>>>>           try (SolrClient solrServerInstance = getSolrClientInstance()){
>>>>               for (int segments=initialMaxSegments;
>>>> segments>=finalMaxSegments; segments--) {
>>>>                   if (isShowSegmentCounter()) {
>>>>                       System.out.println("Optimizing to a max of
>>>> "+segments+"
>>>> segments.");
>>>>                   }
>>>>                   solrServerInstance.optimize(true, true, segments);
>>>>               }
>>>>           } catch (SolrServerException | IOException e) {
>>>>               throw new RuntimeException(e);
>>>>
>>>>           }
>>>>       }
>>>>
>>>>
>>>> On 04/30/2018 04:23 PM, Walter Underwood wrote:
>>>>> You need 2X the minimum index size in disk space anyway, so don’t worry
>>>>> about keeping the indexes as small as possible. Worry about having
>>>>> enough
>>>>> headroom.
>>>>>
>>>>> If your indexes are 250 GB, you need 250 GB of free space.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>>> On Apr 30, 2018, at 1:13 PM, Antony A <antonyaugustus@gmail.com>
wrote:
>>>>>>
>>>>>> Thanks Erick/Deepak.
>>>>>>
>>>>>> The cloud is running on baremetal (128 GB/24 cpu).
>>>>>>
>>>>>> Is there an option to run a compact on the data files to make the
size
>>>>>> equal on both the clouds? I am trying find all the options before
I add
>>>>>> the
>>>>>> new fields into the production cloud.
>>>>>>
>>>>>> Thanks
>>>>>> AA
>>>>>>
>>>>>> On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson
>>>>>> <erickerickson@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Anthony:
>>>>>>>
>>>>>>> You are probably seeing the results of removing deleted documents
from
>>>>>>> the shards as they're merged. Even on replicas in the same _shard_,
>>>>>>> the size of the index on disk won't necessarily be identical.
This has
>>>>>>> to do with which segments are selected for merging, which are
not
>>>>>>> necessarily coordinated across replicas.
>>>>>>>
>>>>>>> The test is if the number of docs on each collection is the same.
If
>>>>>>> it is, then don't worry about index sizes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel <deicool@gmail.com>
>>>>>>> wrote:
>>>>>>>> Could you please also give the machine details of the two
clouds you
>>>>>>>> are
>>>>>>>> running?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Deepak
>>>>>>>> "The greatness of a nation can be judged by the way its animals
are
>>>>>>>> treated. Please stop cruelty to Animals, become a Vegan"
>>>>>>>>
>>>>>>>> +91 73500 12833
>>>>>>>> deicool@gmail.com
>>>>>>>>
>>>>>>>> Facebook: https://www.facebook.com/deicool
>>>>>>>> LinkedIn: www.linkedin.com/in/deicool
>>>>>>>>
>>>>>>>> "Plant a Tree, Go Green"
>>>>>>>>
>>>>>>>> Make In India : http://www.makeinindia.com/home
>>>>>>>>
>>>>>>>> On Mon, Apr 30, 2018 at 9:51 PM, Antony A <antonyaugustus@gmail.com>
>>>>>>> wrote:
>>>>>>>>> Hi Shawn,
>>>>>>>>>
>>>>>>>>> The cloud is running version 6.2.1. with ClassicIndexSchemaFactory
>>>>>>>>>
>>>>>>>>> The sum of size from admin UI on all the shards is around
265 G vs
>>>>>>>>> 224
>>>>>>>>> G
>>>>>>>>> between the two clouds.
>>>>>>>>>
>>>>>>>>> I created the collection using "numShards" so compositeId
router.
>>>>>>>>>
>>>>>>>>> If you need more information, please let me know.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> AA
>>>>>>>>>
>>>>>>>>> On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey <apache@elyograg.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On 4/30/2018 9:51 AM, Antony A wrote:
>>>>>>>>>>
>>>>>>>>>>> I am running two separate solr clouds. I have
8 shards in each
>>>>>>>>>>> with
>>>>>>>>>>> a
>>>>>>>>>>> total
>>>>>>>>>>> of 300 million documents. Both the clouds are
indexing the
>>>>>>>>>>> document
>>>>>>> from
>>>>>>>>>>> the same source/configuration.
>>>>>>>>>>>
>>>>>>>>>>> I am noticing there is a difference in the size
of the collection
>>>>>>>>> between
>>>>>>>>>>> them. I am planning to add more shards to see
if that helps solve
>>>>>>>>>>> the
>>>>>>>>>>> issue. Has anyone come across similar issue?
>>>>>>>>>>>
>>>>>>>>>> There's no information here about exactly what you
are seeing, what
>>>>>>> you
>>>>>>>>>> are expecting to see, and why you believe that what
you are seeing
>>>>>>>>>> is
>>>>>>>>> wrong.
>>>>>>>>>> You did say that there is "a difference in size".
 That is a very
>>>>>>> vague
>>>>>>>>>> problem description.
>>>>>>>>>>
>>>>>>>>>> FYI, unless a SolrCloud collection is using the implicit
router,
>>>>>>>>>> you
>>>>>>>>>> cannot add shards.  And if it *IS* using the implicit
router, then
>>>>>>>>>> you
>>>>>>>>> are
>>>>>>>>>> 100% in control of document routing -- Solr cannot
influence that
>>>>>>>>>> at
>>>>>>> all.
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>>


Mime
View raw message