kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaofeng SHI (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KYLIN-3672) Performance is poor when multiple queries occur in short period
Date Tue, 20 Nov 2018 09:57:00 GMT

     [ https://issues.apache.org/jira/browse/KYLIN-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shaofeng SHI updated KYLIN-3672:
--------------------------------
    Fix Version/s:     (was: v2.6.0)
                   v2.5.2

> Performance is poor when multiple queries occur in short period
> ---------------------------------------------------------------
>
>                 Key: KYLIN-3672
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3672
>             Project: Kylin
>          Issue Type: Bug
>          Components: Query Engine
>    Affects Versions: v2.5.0
>         Environment: CentOS 6.7, HBase 1.2.0+cdh5.14.2+456
>            Reporter: Zongwei Li
>            Assignee: Zongwei Li
>            Priority: Critical
>              Labels: patch, performance
>             Fix For: v2.5.2
>
>         Attachments: KYLIN-3672.master.002.patch, TrendChartAfterFix.png, TrendChartBeforeFix.png,
codeChangedCausedThisBug.png, jstackBeforeBugFix.log
>
>
> Hi, Kylin Team
> We found one Kylin performance bug during performance tuning for our BI report integrate
with Kylin.
>  
> +Background+
> Our BI report show customer usage report to enterprise customers, provide 15 usage charts
in report page.
> Each chart need send API request to Kylin with different SQLs. So it means for one user,
it will trigger 15 API calls(by JDBC) to Kylin.
> For our product scale, we need support at least 20 users to review the report at same
time for each Kylin query node.
> So it means each Kylin node should be able to handle 15 * 20 = 300 queries  per second.
>  
> +Performance Report+
> To reduce the network impact. We built up Kylin cluster and testing machine in the same
network with Hadoop system.
> We use gatling and Jmeter tools to do several round testing, result as follow.
>  
> |Thread|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean Response Time
>  (ms)|
> |1|773|13|77|
> |15|3245|54|279|
> |25|3844|64|390|
> |50|4912|82|612|
> |75|5405|90|841|
> |100|5436|91|1108|
> |150|5434|91|1688|
>  
> And draw the trend chart as follow:
> !TrendChartBeforeFix.png!
>  
> +Conclusion+
> From the trend, when the thread count reach 75, the handled queries per second reaches
peak data 90, and cannot improved by increase the thread count.
> Each Kylin query engine can handle 90 queries per second, it means only support 90/15
= 6 users to review report page at same time.
> Even we setup 3 query nodes, can extend to 18 users at same time, this performance capacity
cannot meet our business requirement.
>  
> +Analyze+
> From test result, response for one thread is fast, but as the thread increase, throughput
of Kylin not increased as we expected.
> We have full code review for Kylin query engine, and use Jstack and JProfile to do analyze,
found the root cause for this performance bottleneck.
> This is one regression bug introduced by new feature involved one year before.
> With bug fixing, one Kylin node can handle 350+ queries per second. Submit this bug for
contribute patch to Kylin.
> +Jstack Log Analyze+
> We use Jstack to capture thread info during performance testing. Already attach one of
them 'jstackBeforeBugFix.log'.
> From the log, we can found that 
> One thread locked at sun.misc.URLClassPath.getNextLoader. TID is {color:#ff0000}*0x000000048007a180*{color}
>  
>  {{"Query e9c44a2d-6226-ff3b-f984-ce8489107d79-3425"}} {{#}}{{3425}} {{daemon prio=}}{{5}}
{{os_prio=}}{{0}} {{tid=}}{{0x000000000472b000}} {{nid=}}{{0x1433}} {{waiting }}{{for}} {{monitor
entry [}}\\\{{0x00007f272e40d000}}{{]}}
>   
>  {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
>   
>  {{    }}{{at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
>   
>  {{    }}{{- locked <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
>   
>  {{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
>   
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
>   
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
>   
>  {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
>   
>  {{    }}{{at java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
>   
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
>   
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
>   
>  {{    }}{{at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
>   
>  {{    }}{{at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
>   
>  43 threads waiting to lock <{color:#ff0000}*0x000000048007a180*{color}> 
>   
>  {{"Query f1f0bbec-a3f7-04b2-1ac6-fd3e03a0232d-4002"}} {{#}}{{4002}} {{daemon prio=}}{{5}}
{{os_prio=}}{{0}} {{tid=}}{{0x00007f27e71e7800}} {{nid=}}{{0x1676}} {{waiting }}{{for}} {{monitor
entry [}}\\\{{0x00007f279f503000}}{{]}}
>  {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
>  {{    }}{{at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
>  {{    }}{{- waiting to lock <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
>  {{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
>  {{    }}{{at java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
>  {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
>  {{    }}{{at java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
>  {{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
>  {{    }}{{at org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
>  {{    }}{{at org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
>  
> After dig into code, we can found that when query engine build request to Hbase CoProcessor,
it will export Kylin properties as String, this issue caused by relative logic.
> +Code Analyze+
> In KylinConfig.class
> function private static OrderedProperties buildSiteOrderedProps() 
> 1.  For each thread, will call getResouce to get "kylin-defaults.properties"
> {panel}
> // 1. load default configurations from classpath.
>  // we have a kylin-defaults.properties in kylin/core-common/src/main/resources
>  URL resource = Thread.currentThread().getContextClassLoader().getResource("kylin-defaults.properties");
>  Preconditions.checkNotNull(resource);
>  logger.info("Loading kylin-defaults.properties from {}", resource.getPath());
>  OrderedProperties orderedProperties = new OrderedProperties();
>  loadPropertiesFromInputStream(resource.openStream(), orderedProperties);
> {panel}
> 2. Exist unused logics to call 10 times to getResouce for  "kylin-defaults" + (i) +
".properties", thread LOCKED occurs here.
> {panel}
> for (int i = 0; i < 10; i++) {
>      String fileName = "kylin-defaults" + (i) + ".properties";
>   URL additionalResource = Thread.currentThread().getContextClassLoader().getResource(fileName);
>   if (additionalResource != null) {
>          logger.info("Loading {} from {} ", fileName, additionalResource.getPath());
>   loadPropertiesFromInputStream(additionalResource.openStream(), orderedProperties);
>   }
>  }
> {panel}
> Those logics involved in 2017/6/7
> !codeChangedCausedThisBug.png!
> +What's the solution+
> 1. Because kylin-defaults.properties is built in kylin-core-common-xxxx.jar, no need
to getResource for it every time.
> Move it into method getInstanceFromEnv(). Because it only need load once.
> And not to impact CoProcessor logic. In CoProcessor class CubeVisitService, it will
use KylinConfig as util class to generate config object from String, it's dangerous to involve
any logic of load properties file with CoProcessor due to there is no Kylin.properties in
its package, so it will not call  KylinConfig.getInstanceFromEnv().
> {panel}
> buildDefaultOrderedProperties();
> {panel}
> 2. Suggest remove unused logic to getResource for "kylin-defaults" + ( i ) + ".properties".
To reduce the risk, just keep those logic in the static block, we'd better have a discuss
whether we need this logic, because there is no ' kylin-defaults" + ( i ) + ".properties'
exist from those logic checked in.
> +Performance Testing Result After Bug Fix+
> |Users|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean Response Time
>  (ms)|
> |1|2451|41|12|
> |15|12422|207|37|
> |25|15600|260|56|
> |50|18481|308|129|
> |75|21055|351|136|
> |100|24036|400|251|
> |150|28014|467|277|
> Trend Chart
> !TrendChartAfterFix.png!
> +Kylin Server Info+
> |*Role*|*vCPU*|*Memory(GB)*|*Volume(GB)*|
> |Query Engine|16 (2.4G)|128|1024|
>  
> +Kylin Package+
> apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)
>  
> +Query SQL+
> SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no PreparedStatement
cache, performance will be more worse). Filter will hit all 6 segments.
>  
> +Cube Info+
> Segment Number: 6 Total Size: 47 MB
>  
> Segment: 20180101000000_20181011000000
> Start Time: 2018-01-01 00:00:00
> End Time: 2018-10-11 00:00:00
> Source Count: 351934019
> HBase Table: KYLIN_69Q9A850DZ
> Region Count: 1
> Size: 47 MB
>  
> Segment: 20181011000000_20181012000000
> Start Time: 2018-10-11 00:00:00
> End Time: 2018-10-12 00:00:00
> Source Count: 7085485
> HBase Table: KYLIN_ZCT39S8FUA
> Region Count: 1
> Size: less than 1 MB
>  
>  
> Segment: 20181012000000_20181013000000
> Start Time: 2018-10-12 00:00:00
> End Time: 2018-10-13 00:00:00
> Source Count: 5534968
> HBase Table: KYLIN_RKRRLA958T
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181013000000_20181014000000
> Start Time: 2018-10-13 00:00:00
> End Time: 2018-10-14 00:00:00
> Source Count: 242856
> HBase Table: KYLIN_Q6DKCONN81
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181014000000_20181015000000
> Start Time: 2018-10-14 00:00:00
> End Time: 2018-10-15 00:00:00
> Source Count: 236122
> HBase Table: KYLIN_JY4WQD2MJH
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181015000000_20181016000000
> Start Time: 2018-10-15 00:00:00
> End Time: 2018-10-16 00:00:00
> Source Count: 6172353
> HBase Table: KYLIN_E2ELLINV22
> Region Count: 1
> Size: less than 1 MB
>  
> +HBase Region Server+ 
> Count: 6
> hbase.regionserver.handler.count: 120.
> Not blocked in CoProcessor RPC call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message