Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 30602200C36 for ; Fri, 10 Mar 2017 12:19:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2ED5E160B69; Fri, 10 Mar 2017 11:19:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 50DB0160B79 for ; Fri, 10 Mar 2017 12:19:09 +0100 (CET) Received: (qmail 53549 invoked by uid 500); 10 Mar 2017 11:19:08 -0000 Mailing-List: contact issues-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list issues@ambari.apache.org Received: (qmail 53540 invoked by uid 99); 10 Mar 2017 11:19:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Mar 2017 11:19:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1E8971A0249 for ; Fri, 10 Mar 2017 11:19:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.451 X-Spam-Level: * X-Spam-Status: No, score=1.451 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id kJzbta7Vrs9A for ; Fri, 10 Mar 2017 11:19:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9644A5F647 for ; Fri, 10 Mar 2017 11:19:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A3D1CE0534 for ; Fri, 10 Mar 2017 11:19:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5F774243AA for ; Fri, 10 Mar 2017 11:19:04 +0000 (UTC) Date: Fri, 10 Mar 2017 11:19:04 +0000 (UTC) From: "Chuan Jin (JIRA)" To: issues@ambari.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (AMBARI-20392) Get aggregate metric records from HBase encounters performance issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 10 Mar 2017 11:19:10 -0000 [ https://issues.apache.org/jira/browse/AMBARI-20392?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Jin updated AMBARI-20392: ------------------------------- Description:=20 I have a mini cluster ( ~6 nodes) managed by Ambari, and use a distributed= HBase (~3 nodes) to hold metrics collected from these nodes. After I dep= loy YARN serivce, then I notice that some widgets (Cluster Memory,Cluster = Disk,...) cannot display properly in the YARN service dashboard page. An= d Ambari Server has continuous timeout exceptions, which complains that it = doesn't get timeline metrics for connection refused. I see the correspond request is like this: /api/v1/clusters/bj_cluster1/services/YARN/components/NODEMANAGER?fields=3D= metrics/yarn/ContainersFailed._rate[1489113738,1489117338,15],metrics/yarn/= ContainersCompleted._rate[1489113738,1489117338,15],metrics/yarn/Containers= Launched._rate[1489113738,1489117338,15],metrics/yarn/ContainersIniting._su= m[1489113738,1489117338,15],metrics/yarn/ContainersKilled._rate[1489113738,= 1489117338,15],metrics/yarn/ContainersRunning._sum[1489113738,1489117338,15= ],metrics/memory/mem_total._avg[1489113738,1489117338,15],metrics/memory/me= m_free._avg[1489113738,1489117338,15],metrics/disk/read_bps._sum[1489113738= ,1489117338,15],metrics/disk/write_bps._sum[1489113738,1489117338,15],metri= cs/network/pkts_in._avg[1489113738,1489117338,15],metrics/network/pkts_out.= _avg[1489113738,1489117338,15],metrics/cpu/cpu_system._sum[1489113738,14891= 17338,15],metrics/cpu/cpu_user._sum[1489113738,1489117338,15],metrics/cpu/c= pu_nice._sum[1489113738,1489117338,15],metrics/cpu/cpu_idle._sum[1489113738= ,1489117338,15],metrics/cpu/cpu_wio._sum[1489113738,1489117338,15]&format= =3Dnull_padding&_=3D1489117333815 In the AMS collector, this request is transformed to a query (not the same = request): 2017-03-10 16:03:56,178 DEBUG [1537616305@qtp-1324937403-125 - /ws/v1/timel= ine/metrics?metricNames=3Dcpu_idle._sum%2Cyarn.NodeManagerMetrics.Container= sCompleted._rate%2Cmem_free._avg%2Cpkts_in._avg%2Cyarn.NodeManagerMetrics.C= ontainersLaunched._rate%2Cyarn.NodeManagerMetrics.ContainersKilled._rate%2C= cpu_wio._sum%2Cyarn.NodeManagerMetrics.ContainersIniting._sum%2Ccpu_system.= _sum%2Ccpu_user._sum%2Ccpu_nice._sum%2Cyarn.NodeManagerMetrics.ContainersFa= iled._rate%2Cmem_total._avg%2Cpkts_out._avg%2Cyarn.NodeManagerMetrics.Conta= inersRunning._sum&appId=3DNODEMANAGER&startTime=3D1489129435&endTime=3D1489= 133035] PhoenixTransactSQL:682 - SQL =3D> SELECT /*+ NATIVE_TIME_RANGE(1489= 129315000) */ METRIC_NAME, APP_ID, INSTANCE_ID, SERVER_TIME, UNITS, METRIC_= SUM, HOSTS_COUNT, METRIC_MAX, METRIC_MIN FROM METRIC_AGGREGATE WHERE (METRI= C_NAME IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)) AND APP_ID =3D ? A= ND SERVER_TIME >=3D ? AND SERVER_TIME < ? ORDER BY METRIC_NAME, SERVER_TIME= LIMIT 15840, condition =3D> Condition{metricNames=3D[pkts_out, cpu_wio, cp= u_idle, yarn.NodeManagerMetrics.ContainersCompleted, mem_total, cpu_nice, y= arn.NodeManagerMetrics.ContainersRunning, pkts_in, yarn.NodeManagerMetrics.= ContainersFailed, yarn.NodeManagerMetrics.ContainersLaunched, mem_free, cpu= _user, yarn.NodeManagerMetrics.ContainersKilled, yarn.NodeManagerMetrics.Co= ntainersIniting, cpu_system], hostnames=3D'null', appId=3D'NODEMANAGER', in= stanceId=3D'null', startTime=3D1489129435, endTime=3D1489133035, limit=3Dnu= ll, grouped=3Dtrue, orderBy=3D[], noLimit=3Dfalse} The request timeout parameter is 5s, which means the query of getting metri= cs from HBase takes more time than that. Then I use Phoenix shell to login = and perform the same query in the HBase , and it takes nearly 30s to finish= . But If I split the big query into small pieces , i mean, use less values= in the "metric_name" field in the where ... in clause , then the result re= turn in 1s after several small queries. =20 The query performance in HBase is highly based on the design of rowkey and = the proper usage for it. In the method of getting aggregate metrics, AMS = collector query the METRIC_AGGREGATE table in a way that may cause the co-= processor to scan several regions across different RS. If we add more metri= cs in the service dashboard, this situation will be worse. was: I have a mini cluster ( ~6 nodes) managed by Ambari, and use a distributed= HBase (~3 nodes) to hold metrics collected from these nodes. After I dep= loy YARN serivce, then I notice that some widgets (Cluster Memory,Cluster = Disk,...) cannot display properly in the YARN service dashboard page. An= d Ambari Server has continuous timeout exceptions, which complains that it = doesn't get timeline metrics for connection refused. The request timeout parameter is 5s, which means the query of getting metri= cs from HBase takes more time than that. Then I use Phoenix shell to login = and perform the same query in the HBase , and it takes nearly 30s to finish= . But If I split the big query into small pieces , i mean, use less values= in the "metric_name" field in the where ... in clause , then the result re= turn in 1s after several small queries. =20 The query performance in HBase is highly based on the design of rowkey and = the proper usage for it. In the method of getting aggregate metrics, AMS = collector query the METRIC_AGGREGATE table in a way that may cause the co-= processor to scan several regions across different RS. If we add more metri= cs in the service dashboard, this situation will be worse. > Get aggregate metric records from HBase encounters performance issues > --------------------------------------------------------------------- > > Key: AMBARI-20392 > URL: https://issues.apache.org/jira/browse/AMBARI-20392 > Project: Ambari > Issue Type: Improvement > Components: ambari-metrics > Affects Versions: 2.4.2 > Reporter: Chuan Jin > > I have a mini cluster ( ~6 nodes) managed by Ambari, and use a distribut= ed HBase (~3 nodes) to hold metrics collected from these nodes. After I d= eploy YARN serivce, then I notice that some widgets (Cluster Memory,Cluste= r Disk,...) cannot display properly in the YARN service dashboard page. = And Ambari Server has continuous timeout exceptions, which complains that i= t doesn't get timeline metrics for connection refused. > I see the correspond request is like this: > /api/v1/clusters/bj_cluster1/services/YARN/components/NODEMANAGER?fields= =3Dmetrics/yarn/ContainersFailed._rate[1489113738,1489117338,15],metrics/ya= rn/ContainersCompleted._rate[1489113738,1489117338,15],metrics/yarn/Contain= ersLaunched._rate[1489113738,1489117338,15],metrics/yarn/ContainersIniting.= _sum[1489113738,1489117338,15],metrics/yarn/ContainersKilled._rate[14891137= 38,1489117338,15],metrics/yarn/ContainersRunning._sum[1489113738,1489117338= ,15],metrics/memory/mem_total._avg[1489113738,1489117338,15],metrics/memory= /mem_free._avg[1489113738,1489117338,15],metrics/disk/read_bps._sum[1489113= 738,1489117338,15],metrics/disk/write_bps._sum[1489113738,1489117338,15],me= trics/network/pkts_in._avg[1489113738,1489117338,15],metrics/network/pkts_o= ut._avg[1489113738,1489117338,15],metrics/cpu/cpu_system._sum[1489113738,14= 89117338,15],metrics/cpu/cpu_user._sum[1489113738,1489117338,15],metrics/cp= u/cpu_nice._sum[1489113738,1489117338,15],metrics/cpu/cpu_idle._sum[1489113= 738,1489117338,15],metrics/cpu/cpu_wio._sum[1489113738,1489117338,15]&forma= t=3Dnull_padding&_=3D1489117333815 > In the AMS collector, this request is transformed to a query (not the sam= e request): > 2017-03-10 16:03:56,178 DEBUG [1537616305@qtp-1324937403-125 - /ws/v1/tim= eline/metrics?metricNames=3Dcpu_idle._sum%2Cyarn.NodeManagerMetrics.Contain= ersCompleted._rate%2Cmem_free._avg%2Cpkts_in._avg%2Cyarn.NodeManagerMetrics= .ContainersLaunched._rate%2Cyarn.NodeManagerMetrics.ContainersKilled._rate%= 2Ccpu_wio._sum%2Cyarn.NodeManagerMetrics.ContainersIniting._sum%2Ccpu_syste= m._sum%2Ccpu_user._sum%2Ccpu_nice._sum%2Cyarn.NodeManagerMetrics.Containers= Failed._rate%2Cmem_total._avg%2Cpkts_out._avg%2Cyarn.NodeManagerMetrics.Con= tainersRunning._sum&appId=3DNODEMANAGER&startTime=3D1489129435&endTime=3D14= 89133035] PhoenixTransactSQL:682 - SQL =3D> SELECT /*+ NATIVE_TIME_RANGE(14= 89129315000) */ METRIC_NAME, APP_ID, INSTANCE_ID, SERVER_TIME, UNITS, METRI= C_SUM, HOSTS_COUNT, METRIC_MAX, METRIC_MIN FROM METRIC_AGGREGATE WHERE (MET= RIC_NAME IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)) AND APP_ID =3D ?= AND SERVER_TIME >=3D ? AND SERVER_TIME < ? ORDER BY METRIC_NAME, SERVER_TI= ME LIMIT 15840, condition =3D> Condition{metricNames=3D[pkts_out, cpu_wio, = cpu_idle, yarn.NodeManagerMetrics.ContainersCompleted, mem_total, cpu_nice,= yarn.NodeManagerMetrics.ContainersRunning, pkts_in, yarn.NodeManagerMetric= s.ContainersFailed, yarn.NodeManagerMetrics.ContainersLaunched, mem_free, c= pu_user, yarn.NodeManagerMetrics.ContainersKilled, yarn.NodeManagerMetrics.= ContainersIniting, cpu_system], hostnames=3D'null', appId=3D'NODEMANAGER', = instanceId=3D'null', startTime=3D1489129435, endTime=3D1489133035, limit=3D= null, grouped=3Dtrue, orderBy=3D[], noLimit=3Dfalse} > The request timeout parameter is 5s, which means the query of getting met= rics from HBase takes more time than that. Then I use Phoenix shell to logi= n and perform the same query in the HBase , and it takes nearly 30s to fini= sh. But If I split the big query into small pieces , i mean, use less valu= es in the "metric_name" field in the where ... in clause , then the result = return in 1s after several small queries. =20 > The query performance in HBase is highly based on the design of rowkey an= d the proper usage for it. In the method of getting aggregate metrics, AM= S collector query the METRIC_AGGREGATE table in a way that may cause the c= o-processor to scan several regions across different RS. If we add more met= rics in the service dashboard, this situation will be worse. -- This message was sent by Atlassian JIRA (v6.3.15#6346)