From dev-return-345300-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Fri Jan 18 20:27:29 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5DA86180663 for ; Fri, 18 Jan 2019 20:27:28 +0100 (CET) Received: (qmail 21919 invoked by uid 500); 18 Jan 2019 19:27:27 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 21864 invoked by uid 99); 18 Jan 2019 19:27:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2019 19:27:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1C7C4C23F0 for ; Fri, 18 Jan 2019 19:27:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.298 X-Spam-Level: ** X-Spam-Status: No, score=2.298 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id H6XSJm72yduY for ; Fri, 18 Jan 2019 19:27:24 +0000 (UTC) Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 845AF6107F for ; Fri, 18 Jan 2019 19:27:24 +0000 (UTC) Received: by mail-qt1-f173.google.com with SMTP id p17so16483404qtl.5 for ; Fri, 18 Jan 2019 11:27:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=yBTosxktF+oZVjeQw3SefMhonK/yikEcT0jDxEt7iRM=; b=HO3zbDG0Q/WSNx/9mdvjxm5x8SNl9SwSu6YUs1Y4tyQXE14nCnecQg/35S5GRDB6JC Jnog2yXcW8s59BMEuPZxg2cDsV4Vc+KY/o0/ytkR0nWdsy2Nb9KfYT5bMhtRzBJEHu8Z HATlRKWf6CjnCfVxVllxif8i9sX7319uZB+ek82tiGxtmOja0bVDhkXS9Pitxxge4g8u G9T8B484PBD4Bam76qW2cvRKq46tWA7oj9CLz15ed8Q1dfP7mt9l/3ZffZuRfvVk1JVG 1n96VXZvPqusST2p8c6fwc3iXV1yrzAVnPSYFkrUKXDvvN9RTJRnDCKVJRGn10opjxz+ lQPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=yBTosxktF+oZVjeQw3SefMhonK/yikEcT0jDxEt7iRM=; b=XcsUhRJnQgwZ7woGUl7w52w1tbNK69km3Ce9+ZsX6Vwwz3oFCXsSgVcx9MulQ1LZ2+ A/ES9lK04tRsRDq8ysgmStpAVttGDuOyPpP+y5EHbFwxyWrF2UJjvwg1qUNEgOY3XztI sBS3bcMXuuHwa9YsteSSaCraqsdHENkMBgbHv/MKOAc0dsZ5PciCOveZJraxqX8jWbjK jiPkK3b2lpwpMMqZriivbuSP+GGk1rNCxkWVKqG+gxBfrXXiaHTdHCphJ7c21P7ftYFX LAybHOZ/L6i+JrvL+D24rYZzko1OZwo64mV7wPQi2y7KvCSVdvC9r+IhmUQqbc76pCtV YLQw== X-Gm-Message-State: AJcUuke3WvOzGC6hTPY8QzhXnfw+SJXEJUt9E6VW3w63olQh0RHA7Svd QnMrWkCqhduQZ0t245bpqTIA5S5TPcpV2gF9h/Uy+SBd X-Google-Smtp-Source: ALg8bN6kXH64+VnSpPubK80pRtlRwp2T/p9NO05KYoZvdHvzT8n8R2rEBkrvj11FfnFO5XNRNDLAclGJEaJx5bcwRUI= X-Received: by 2002:a0c:d933:: with SMTP id p48mr17405615qvj.15.1547839638428; Fri, 18 Jan 2019 11:27:18 -0800 (PST) MIME-Version: 1.0 From: Gus Heck Date: Fri, 18 Jan 2019 14:27:07 -0500 Message-ID: Subject: Autoscaling in 8.0 To: dev Content-Type: multipart/alternative; boundary="000000000000d16010057fc07fbc" --000000000000d16010057fc07fbc Content-Type: text/plain; charset="UTF-8" I'm a little worried about the state of Autoscaling. It looks like it has the potential to create bad first experiences. Granted 8.0 isn't supposed to be stable, but I'm seeing things that were documented for 7.6 not working in 8x TLDR: 1. Default settings didn't distribute nodes evenly on brand new 50 node cluster 2. Can't seem to write rules producing suggestions to distribute them evenly 3. Suggestions are made that then fail despite quiet cluster, no changes. Long version: My Client and I did something that seems very vanilla but it didn't work out well, and the observed behavior contradicts what's published in https://lucene.apache.org/solr/guide/7_6/solr-upgrade-notes.html#solr-7-6 with respect to default core placement. The cluster is a 50 node AWS cluster that was freshly set up by a client to test out 8.0.0 (8.0.0-SNAPSHOT 69cbe29e78c400db22aab2f918405ce627d2d65d - solr - 2019-01-11 15:41:35). They created a collection (A) with 50 shards, one replica each (total of50 cores). They specified maxShardsPerNode=1, and nothing relating to autoscaling. They indexed a small amount of data in (33438861 docs is small for them) for initial testing. They then handed it over to me, and not yet noticing anything wrong with it I added a second collection (B) similarly configured but with schema changes for comparison. However, I noticed at that point that the nodes page was showing a very strange result for this seemingly vanilla set of steps. Most nodes got one core of each collection, but not all: Node 1 got 2 cores from A Node 2 got 0 cores Node 8 got 3 cores from B Node 21 got 2 cores from A and 1 from B I've spent all morning fiddling with rules to try to get a configuration that provides suggestions via /api/cluster/autoscaling/suggestions to equalize things and I just can't do it. In particular I can't ever get any suggestion to move anything to node 2. It's as if autoscaling is missing/unable to see node 2. A couple of times I got suggestions with green buttons in the UI (mostly I'm using Postman however)... when I clicked the green button it erred out saying no-node can satisfy.... Nothing's changing, no data incoming so why is it suggesting things that don't work? When I look at /autoscaling/diagnostics I get this seemingly impossible result: { "node": "solr-2.customer.redacted.com:8983_solr", "isLive": true, "cores": 2, "freedisk": 140.03918838500977, "totaldisk": 147.5209503173828, "replicas": {} }, 2 cores but no replicas? I looked on disk and there's no data on disk representing a core. -Gus -- http://www.the111shift.com --000000000000d16010057fc07fbc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I'm a little worried= about the state of Autoscaling. It looks like it has the potential to crea= te bad first experiences. Granted 8.0 isn't supposed to be stable, but = I'm seeing things that were documented for 7.6 not working in 8x
<= div dir=3D"ltr">
TLDR:=C2=A0=C2=A0
  1. Default = settings didn't distribute nodes evenly on brand new 50 node cluster
  2. Can't seem to write rules producing suggestions to distribute= them evenly
  3. Suggestions are made that then fail despite quiet clus= ter, no changes.
Long version:

My Client and I did something that see= ms very vanilla but it didn't work out well, and the observed behavior = contradicts what's published in=C2=A0https://lucene.apache.o= rg/solr/guide/7_6/solr-upgrade-notes.html#solr-7-6 with respect to defa= ult core placement.=C2=A0

= The cluster is a 50 node AWS cluster that was freshly set up by a client to= test out 8.0.0=C2=A0(8.0.0-SNAPSHOT 69cbe29e78c400db22aab2f918405ce627d2d6= 5d - solr - 2019-01-11 15:41:35).

=
They created a collection (A) with 50 shards, one replica each (total = of50 cores). They specified maxShardsPerNode=3D1, and nothing relating to a= utoscaling. They indexed a small amount of data in=C2=A0(33438861 docs is s= mall for them) for initial testing. They then handed it over to me, and not= yet noticing anything wrong with it I added a second collection (B) simila= rly configured but with schema changes for comparison. However, I noticed a= t that point that the nodes page was showing a very strange result for this= seemingly vanilla set of steps. Most nodes got one core of each collection= , but not all:

Node 1 got 2 cores from A<= br>
Node 2 got 0 cores
Node 8 got 3 cores from B
<= div>Node 21 got 2 cores from A and 1 from B

I'= ve spent all morning fiddling with rules to try to get a configuration that= provides suggestions via /api/cluster/autoscaling/suggestions to equalize = things and I just can't do it. In particular I can't ever get any s= uggestion to move anything to node 2. It's as if autoscaling is missing= /unable to see node 2. A couple of times I got suggestions with green butto= ns in the UI (mostly I'm using Postman however)... when I clicked the g= reen button it erred out saying no-node can satisfy.... Nothing's chang= ing, no data incoming so why is it suggesting things that don't work?

When I look at /autoscaling/diagnostics I get this = seemingly impossible result:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 "node": "solr-2.customer.redacted.com:8983_solr",
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "isLi= ve": true,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 "cores": 2,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 "freedisk": 140.03918838500977,
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "totaldisk&quo= t;: 147.5209503173828,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 "replicas": {}
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 },

2 cores but no replicas= ? I looked on disk and there's no data on disk representing a core.

-Gus

--000000000000d16010057fc07fbc--