kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhong Yanghong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KYLIN-3672) Performance is poor when multiple queries occur in short period
Date Fri, 09 Nov 2018 02:42:01 GMT

    [ https://issues.apache.org/jira/browse/KYLIN-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680763#comment-16680763
] 

Zhong Yanghong commented on KYLIN-3672:
---------------------------------------

awesome analysis

> Performance is poor when multiple queries occur in short period
> ---------------------------------------------------------------
>
>                 Key: KYLIN-3672
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3672
>             Project: Kylin
>          Issue Type: Bug
>          Components: Query Engine
>    Affects Versions: v2.5.0
>         Environment: CentOS 6.7, HBase 1.2.0+cdh5.14.2+456
>            Reporter: Zongwei Li
>            Assignee: Zongwei Li
>            Priority: Critical
>              Labels: patch, performance
>             Fix For: v2.6.0
>
>         Attachments: KYLIN-3672.master.001.patch, TrendChartAfterFix.png, TrendChartBeforeFix.png,
codeChangedCausedThisBug.png, jstackBeforeBugFix.log
>
>
> Hi, Kylin Team
> We found one Kylin performance bug during performance tuning for our BI report integrate
with Kylin.
>  
> +Background+
> Our BI report show customer usage report to enterprise customers, provide 15 usage charts
in report page.
> Each chart need send API request to Kylin with different SQLs. So it means for one user,
it will trigger 15 API calls(by JDBC) to Kylin.
> For our product scale, we need support at least 20 users to review the report at same
time for each Kylin query node.
> So it means each Kylin node should be able to handle 15 * 20 = 300 queries  per second.
>  
> +Performance Report+
> To reduce the network impact. We built up Kylin cluster and testing machine in the same
network with Hadoop system.
> We use gatling and Jmeter tools to do several round testing, result as follow.
>  
> |Thread|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean Response Time
>  (ms)|
> |1|773|13|77|
> |15|3245|54|279|
> |25|3844|64|390|
> |50|4912|82|612|
> |75|5405|90|841|
> |100|5436|91|1108|
> |150|5434|91|1688|
>  
> And draw the trend chart as follow:
> !TrendChartBeforeFix.png!
>  
> +Conclusion+
> From the trend, when the thread count reach 75, the handled queries per second reaches
peak data 90, and cannot improved by increase the thread count.
> Each Kylin query engine can handle 90 queries per second, it means only support 90/15
= 6 users to review report page at same time.
> Even we setup 3 query nodes, can extend to 18 users at same time, this performance capacity
cannot meet our business requirement.
>  
> +Analyze+
> From test result, response for one thread is fast, but as the thread increase, throughput
of Kylin not increased as we expected.
> We have full code review for Kylin query engine, and use Jstack and JProfile to do analyze,
found the root cause for this performance bottleneck.
> This is one regression bug introduced by new feature involved one year before.
> With bug fixing, one Kylin node can handle 350+ queries per second. Submit this bug for
contribute patch to Kylin.
> +Jstack Log Analyze+
> We use Jstack to capture thread info during performance testing. Already attach one of
them 'jstackBeforeBugFix.log'.
> From the log, we can found that 
> One thread locked at sun.misc.URLClassPath.getNextLoader. TID is {color:#ff0000}*0x000000048007a180*{color}
>  
>  {{"Query e9c44a2d-6226-ff3b-f984-ce8489107d79-3425"}} {{#}}{{3425}} {{daemon prio=}}{{5}}
{{os_prio=}}{{0}} {{tid=}}{{0x000000000472b000}} {{nid=}}{{0x1433}} {{waiting }}{{for}} {{monitor
entry [}}\\{{0x00007f272e40d000}}{{]}}
>   
>  {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
>   
>  {{    }}{{at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
>   
>  {{    }}{{- locked <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
>   
>  {{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
>   
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
>   
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
>   
>  {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
>   
>  {{    }}{{at java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
>   
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
>   
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
>   
>  {{    }}{{at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
>   
>  {{    }}{{at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
>   
>  43 threads waiting to lock <{color:#ff0000}*0x000000048007a180*{color}> 
>   
>  {{"Query f1f0bbec-a3f7-04b2-1ac6-fd3e03a0232d-4002"}} {{#}}{{4002}} {{daemon prio=}}{{5}}
{{os_prio=}}{{0}} {{tid=}}{{0x00007f27e71e7800}} {{nid=}}{{0x1676}} {{waiting }}{{for}} {{monitor
entry [}}\\{{0x00007f279f503000}}{{]}}
>  {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
>  {{    }}{{at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
>  {{    }}{{- waiting to lock <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
>  {{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
>  {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
>  {{    }}{{at java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
>  {{    }}{{at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
>  {{    }}{{at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
>  
> After dig into code, we can found that when query engine build request to Hbase CoProcessor,
it will export Kylin properties as String, this issue caused by relative logic.
> +Code Analyze+
> In KylinConfig.class
> function private static OrderedProperties buildSiteOrderedProps() 
> 1.  For each thread, will call getResouce to get "kylin-defaults.properties"
> {panel}
> // 1. load default configurations from classpath.
>  // we have a kylin-defaults.properties in kylin/core-common/src/main/resources
>  URL resource = Thread.currentThread().getContextClassLoader().getResource("kylin-defaults.properties");
>  Preconditions.checkNotNull(resource);
>  logger.info("Loading kylin-defaults.properties from {}", resource.getPath());
>  OrderedProperties orderedProperties = new OrderedProperties();
>  loadPropertiesFromInputStream(resource.openStream(), orderedProperties);
> {panel}
> 2. Exist unused logics to call 10 times to getResouce for  "kylin-defaults" + (i) +
".properties", thread LOCKED occurs here.
> {panel}
> for (int i = 0; i < 10; i++) {
>      String fileName = "kylin-defaults" + (i) + ".properties";
>   URL additionalResource = Thread.currentThread().getContextClassLoader().getResource(fileName);
>   if (additionalResource != null) {
>          logger.info("Loading {} from {} ", fileName, additionalResource.getPath());
>   loadPropertiesFromInputStream(additionalResource.openStream(), orderedProperties);
>   }
>  }
> {panel}
> Those logics involved in 2017/6/7
> !codeChangedCausedThisBug.png!
> +What's the solution+
> 1. Because kylin-defaults.properties is built in kylin-core-common-xxxx.jar, no need
to getResource for it every time.
> Move it to static block. Only need load once the class initialized
> {panel}
> buildDefaultOrderedProperties();
> {panel}
> 2. Suggest remove unused logic to getResource for "kylin-defaults" + (i) + ".properties".
To reduce the risk, just keep those logic in the static block, we'd better have a discuss
whether we need this logic, because there is no ' kylin-defaults" + (i) + ".properties' exist
from those logic checked in.
> +Performance Testing Result After Bug Fix+
> |Users|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean Response Time
> (ms)|
> |1|2451|41|12|
> |15|12422|207|37|
> |25|15600|260|56|
> |50|18481|308|129|
> |75|21055|351|136|
> |100|24036|400|251|
> |150|28014|467|277|
> Trend Chart
> !TrendChartAfterFix.png!
> +Kylin Server Info+
> |*Role*|*vCPU*|*Memory(GB)*|*Volume(GB)*|
> |Query Engine|16 (2.4G)|128|1024|
>  
> +Kylin Package+
> apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)
>  
> +Query SQL+
> SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no PreparedStatement
cache, performance will be more worse). Filter will hit all 6 segments.
>  
> +Cube Info+
> Segment Number: 6 Total Size: 47 MB
>  
> Segment: 20180101000000_20181011000000
> Start Time: 2018-01-01 00:00:00
> End Time: 2018-10-11 00:00:00
> Source Count: 351934019
> HBase Table: KYLIN_69Q9A850DZ
> Region Count: 1
> Size: 47 MB
>  
> Segment: 20181011000000_20181012000000
> Start Time: 2018-10-11 00:00:00
> End Time: 2018-10-12 00:00:00
> Source Count: 7085485
> HBase Table: KYLIN_ZCT39S8FUA
> Region Count: 1
> Size: less than 1 MB
>  
>  
> Segment: 20181012000000_20181013000000
> Start Time: 2018-10-12 00:00:00
> End Time: 2018-10-13 00:00:00
> Source Count: 5534968
> HBase Table: KYLIN_RKRRLA958T
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181013000000_20181014000000
> Start Time: 2018-10-13 00:00:00
> End Time: 2018-10-14 00:00:00
> Source Count: 242856
> HBase Table: KYLIN_Q6DKCONN81
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181014000000_20181015000000
> Start Time: 2018-10-14 00:00:00
> End Time: 2018-10-15 00:00:00
> Source Count: 236122
> HBase Table: KYLIN_JY4WQD2MJH
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181015000000_20181016000000
> Start Time: 2018-10-15 00:00:00
> End Time: 2018-10-16 00:00:00
> Source Count: 6172353
> HBase Table: KYLIN_E2ELLINV22
> Region Count: 1
> Size: less than 1 MB
>  
> +HBase Region Server+ 
> Count: 6
> hbase.regionserver.handler.count: 120.
> Not blocked in CoProcessor RPC call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message