Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ambari.apache.org
Date: Fri, 10 Mar 2017 11:19:04 +0000 (UTC)
From: "Chuan Jin (JIRA)" <jira@apache.org>
To: issues@ambari.apache.org
Message-ID: <JIRA.13049994.1489141603000.1810.1489144744388@Atlassian.JIRA>
In-Reply-To: <JIRA.13049994.1489141603000@Atlassian.JIRA>
References: <JIRA.13049994.1489141603000@Atlassian.JIRA> <JIRA.13049994.1489141603934@jira-lw-us.apache.org>
Subject: [jira] [Updated] (AMBARI-20392) Get aggregate metric records from
 HBase encounters performance issues
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 10 Mar 2017 11:19:10 -0000


     [ https://issues.apache.org/jira/browse/AMBARI-20392?page=3Dcom.atlass=
ian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chuan Jin updated AMBARI-20392:
-------------------------------
    Description:=20
I have a mini cluster ( ~6 nodes)  managed by Ambari, and use a distributed=
 HBase (~3 nodes) to hold  metrics collected from these nodes.  After I dep=
loy YARN serivce, then I notice that  some widgets (Cluster Memory,Cluster =
Disk,...)  cannot  display properly in the YARN service dashboard page.  An=
d Ambari Server has continuous timeout exceptions, which complains that it =
doesn't get timeline metrics for connection refused.

I see the correspond request is like this:
/api/v1/clusters/bj_cluster1/services/YARN/components/NODEMANAGER?fields=3D=
metrics/yarn/ContainersFailed._rate[1489113738,1489117338,15],metrics/yarn/=
ContainersCompleted._rate[1489113738,1489117338,15],metrics/yarn/Containers=
Launched._rate[1489113738,1489117338,15],metrics/yarn/ContainersIniting._su=
m[1489113738,1489117338,15],metrics/yarn/ContainersKilled._rate[1489113738,=
1489117338,15],metrics/yarn/ContainersRunning._sum[1489113738,1489117338,15=
],metrics/memory/mem_total._avg[1489113738,1489117338,15],metrics/memory/me=
m_free._avg[1489113738,1489117338,15],metrics/disk/read_bps._sum[1489113738=
,1489117338,15],metrics/disk/write_bps._sum[1489113738,1489117338,15],metri=
cs/network/pkts_in._avg[1489113738,1489117338,15],metrics/network/pkts_out.=
_avg[1489113738,1489117338,15],metrics/cpu/cpu_system._sum[1489113738,14891=
17338,15],metrics/cpu/cpu_user._sum[1489113738,1489117338,15],metrics/cpu/c=
pu_nice._sum[1489113738,1489117338,15],metrics/cpu/cpu_idle._sum[1489113738=
,1489117338,15],metrics/cpu/cpu_wio._sum[1489113738,1489117338,15]&format=
=3Dnull_padding&_=3D1489117333815

In the AMS collector, this request is transformed to a query (not the same =
request):
2017-03-10 16:03:56,178 DEBUG [1537616305@qtp-1324937403-125 - /ws/v1/timel=
ine/metrics?metricNames=3Dcpu_idle._sum%2Cyarn.NodeManagerMetrics.Container=
sCompleted._rate%2Cmem_free._avg%2Cpkts_in._avg%2Cyarn.NodeManagerMetrics.C=
ontainersLaunched._rate%2Cyarn.NodeManagerMetrics.ContainersKilled._rate%2C=
cpu_wio._sum%2Cyarn.NodeManagerMetrics.ContainersIniting._sum%2Ccpu_system.=
_sum%2Ccpu_user._sum%2Ccpu_nice._sum%2Cyarn.NodeManagerMetrics.ContainersFa=
iled._rate%2Cmem_total._avg%2Cpkts_out._avg%2Cyarn.NodeManagerMetrics.Conta=
inersRunning._sum&appId=3DNODEMANAGER&startTime=3D1489129435&endTime=3D1489=
133035] PhoenixTransactSQL:682 - SQL =3D> SELECT /*+ NATIVE_TIME_RANGE(1489=
129315000) */ METRIC_NAME, APP_ID, INSTANCE_ID, SERVER_TIME, UNITS, METRIC_=
SUM, HOSTS_COUNT, METRIC_MAX, METRIC_MIN FROM METRIC_AGGREGATE WHERE (METRI=
C_NAME IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)) AND APP_ID =3D ? A=
ND SERVER_TIME >=3D ? AND SERVER_TIME < ? ORDER BY METRIC_NAME, SERVER_TIME=
 LIMIT 15840, condition =3D> Condition{metricNames=3D[pkts_out, cpu_wio, cp=
u_idle, yarn.NodeManagerMetrics.ContainersCompleted, mem_total, cpu_nice, y=
arn.NodeManagerMetrics.ContainersRunning, pkts_in, yarn.NodeManagerMetrics.=
ContainersFailed, yarn.NodeManagerMetrics.ContainersLaunched, mem_free, cpu=
_user, yarn.NodeManagerMetrics.ContainersKilled, yarn.NodeManagerMetrics.Co=
ntainersIniting, cpu_system], hostnames=3D'null', appId=3D'NODEMANAGER', in=
stanceId=3D'null', startTime=3D1489129435, endTime=3D1489133035, limit=3Dnu=
ll, grouped=3Dtrue, orderBy=3D[], noLimit=3Dfalse}

The request timeout parameter is 5s, which means the query of getting metri=
cs from HBase takes more time than that. Then I use Phoenix shell to login =
and perform the same query in the HBase , and it takes nearly 30s to finish=
.  But If I split the big query into small pieces , i mean, use less values=
 in the "metric_name" field in the where ... in clause , then the result re=
turn in 1s after several small queries. =20

The query performance in HBase is highly based on the design of rowkey and =
the proper usage for it.  In the method of getting aggregate metrics,  AMS =
collector query the METRIC_AGGREGATE  table in a way that may cause the co-=
processor to scan several regions across different RS. If we add more metri=
cs in the service dashboard, this situation will be worse.

  was:
I have a mini cluster ( ~6 nodes)  managed by Ambari, and use a distributed=
 HBase (~3 nodes) to hold  metrics collected from these nodes.  After I dep=
loy YARN serivce, then I notice that  some widgets (Cluster Memory,Cluster =
Disk,...)  cannot  display properly in the YARN service dashboard page.  An=
d Ambari Server has continuous timeout exceptions, which complains that it =
doesn't get timeline metrics for connection refused.

The request timeout parameter is 5s, which means the query of getting metri=
cs from HBase takes more time than that. Then I use Phoenix shell to login =
and perform the same query in the HBase , and it takes nearly 30s to finish=
.  But If I split the big query into small pieces , i mean, use less values=
 in the "metric_name" field in the where ... in clause , then the result re=
turn in 1s after several small queries. =20

The query performance in HBase is highly based on the design of rowkey and =
the proper usage for it.  In the method of getting aggregate metrics,  AMS =
collector query the METRIC_AGGREGATE  table in a way that may cause the co-=
processor to scan several regions across different RS. If we add more metri=
cs in the service dashboard, this situation will be worse.


> Get aggregate metric records from HBase encounters performance issues
> ---------------------------------------------------------------------
>
>                 Key: AMBARI-20392
>                 URL: https://issues.apache.org/jira/browse/AMBARI-20392
>             Project: Ambari
>          Issue Type: Improvement
>          Components: ambari-metrics
>    Affects Versions: 2.4.2
>            Reporter: Chuan Jin
>
> I have a mini cluster ( ~6 nodes)  managed by Ambari, and use a distribut=
ed HBase (~3 nodes) to hold  metrics collected from these nodes.  After I d=
eploy YARN serivce, then I notice that  some widgets (Cluster Memory,Cluste=
r Disk,...)  cannot  display properly in the YARN service dashboard page.  =
And Ambari Server has continuous timeout exceptions, which complains that i=
t doesn't get timeline metrics for connection refused.
> I see the correspond request is like this:
> /api/v1/clusters/bj_cluster1/services/YARN/components/NODEMANAGER?fields=
=3Dmetrics/yarn/ContainersFailed._rate[1489113738,1489117338,15],metrics/ya=
rn/ContainersCompleted._rate[1489113738,1489117338,15],metrics/yarn/Contain=
ersLaunched._rate[1489113738,1489117338,15],metrics/yarn/ContainersIniting.=
_sum[1489113738,1489117338,15],metrics/yarn/ContainersKilled._rate[14891137=
38,1489117338,15],metrics/yarn/ContainersRunning._sum[1489113738,1489117338=
,15],metrics/memory/mem_total._avg[1489113738,1489117338,15],metrics/memory=
/mem_free._avg[1489113738,1489117338,15],metrics/disk/read_bps._sum[1489113=
738,1489117338,15],metrics/disk/write_bps._sum[1489113738,1489117338,15],me=
trics/network/pkts_in._avg[1489113738,1489117338,15],metrics/network/pkts_o=
ut._avg[1489113738,1489117338,15],metrics/cpu/cpu_system._sum[1489113738,14=
89117338,15],metrics/cpu/cpu_user._sum[1489113738,1489117338,15],metrics/cp=
u/cpu_nice._sum[1489113738,1489117338,15],metrics/cpu/cpu_idle._sum[1489113=
738,1489117338,15],metrics/cpu/cpu_wio._sum[1489113738,1489117338,15]&forma=
t=3Dnull_padding&_=3D1489117333815
> In the AMS collector, this request is transformed to a query (not the sam=
e request):
> 2017-03-10 16:03:56,178 DEBUG [1537616305@qtp-1324937403-125 - /ws/v1/tim=
eline/metrics?metricNames=3Dcpu_idle._sum%2Cyarn.NodeManagerMetrics.Contain=
ersCompleted._rate%2Cmem_free._avg%2Cpkts_in._avg%2Cyarn.NodeManagerMetrics=
.ContainersLaunched._rate%2Cyarn.NodeManagerMetrics.ContainersKilled._rate%=
2Ccpu_wio._sum%2Cyarn.NodeManagerMetrics.ContainersIniting._sum%2Ccpu_syste=
m._sum%2Ccpu_user._sum%2Ccpu_nice._sum%2Cyarn.NodeManagerMetrics.Containers=
Failed._rate%2Cmem_total._avg%2Cpkts_out._avg%2Cyarn.NodeManagerMetrics.Con=
tainersRunning._sum&appId=3DNODEMANAGER&startTime=3D1489129435&endTime=3D14=
89133035] PhoenixTransactSQL:682 - SQL =3D> SELECT /*+ NATIVE_TIME_RANGE(14=
89129315000) */ METRIC_NAME, APP_ID, INSTANCE_ID, SERVER_TIME, UNITS, METRI=
C_SUM, HOSTS_COUNT, METRIC_MAX, METRIC_MIN FROM METRIC_AGGREGATE WHERE (MET=
RIC_NAME IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)) AND APP_ID =3D ?=
 AND SERVER_TIME >=3D ? AND SERVER_TIME < ? ORDER BY METRIC_NAME, SERVER_TI=
ME LIMIT 15840, condition =3D> Condition{metricNames=3D[pkts_out, cpu_wio, =
cpu_idle, yarn.NodeManagerMetrics.ContainersCompleted, mem_total, cpu_nice,=
 yarn.NodeManagerMetrics.ContainersRunning, pkts_in, yarn.NodeManagerMetric=
s.ContainersFailed, yarn.NodeManagerMetrics.ContainersLaunched, mem_free, c=
pu_user, yarn.NodeManagerMetrics.ContainersKilled, yarn.NodeManagerMetrics.=
ContainersIniting, cpu_system], hostnames=3D'null', appId=3D'NODEMANAGER', =
instanceId=3D'null', startTime=3D1489129435, endTime=3D1489133035, limit=3D=
null, grouped=3Dtrue, orderBy=3D[], noLimit=3Dfalse}
> The request timeout parameter is 5s, which means the query of getting met=
rics from HBase takes more time than that. Then I use Phoenix shell to logi=
n and perform the same query in the HBase , and it takes nearly 30s to fini=
sh.  But If I split the big query into small pieces , i mean, use less valu=
es in the "metric_name" field in the where ... in clause , then the result =
return in 1s after several small queries. =20
> The query performance in HBase is highly based on the design of rowkey an=
d the proper usage for it.  In the method of getting aggregate metrics,  AM=
S collector query the METRIC_AGGREGATE  table in a way that may cause the c=
o-processor to scan several regions across different RS. If we add more met=
rics in the service dashboard, this situation will be worse.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)