Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3B1F2200C67 for ; Mon, 15 May 2017 10:45:18 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 39C45160BC2; Mon, 15 May 2017 08:45:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3C4E2160BC1 for ; Mon, 15 May 2017 10:45:16 +0200 (CEST) Received: (qmail 26838 invoked by uid 500); 15 May 2017 08:45:15 -0000 Mailing-List: contact user-help@kylin.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kylin.apache.org Delivered-To: mailing list user@kylin.apache.org Received: (qmail 26828 invoked by uid 99); 15 May 2017 08:45:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 May 2017 08:45:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E07271800A8 for ; Mon, 15 May 2017 08:45:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.148 X-Spam-Level: ** X-Spam-Status: No, score=2.148 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=msn.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 65o98j2xh5KS for ; Mon, 15 May 2017 08:45:05 +0000 (UTC) Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-oln040092005068.outbound.protection.outlook.com [40.92.5.68]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 892635FDBC for ; Mon, 15 May 2017 08:45:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=msn.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=H7e8w38ALniRmc2MNVJyeB6d4v8Tl36dVcQfci8M6+k=; b=JutpzaQBSMYX34zd2DEHs6JIzwgRGw8iZ83pY1Hzu/CvN8bk775kiKqoduRdiIcIEWkEd7dsw8ZxPgSV9l5mH4wVdRZ/kkPUZUyrQTdJhWd3S2IgngzEBJ1lkbXh0TLHUsDUFNZtX7ySRT6WBwr8NsHO+z3fwkuT38QYFtJA8sHdD2PwThqrRTUl+6f/d9mfl+8Sekpj+/FCwk6ocqB4T3OqDDcKTXjO/pok+UwGLl3ll4Bya4goWUMH2/cZk0T8Q1MxYyjIBvuGVBgheDaI1cFLIUvg7ThfIrvPAq/zJR1QRRaCnL4xN9AYkMLCCrc/Duww55X9I27TtVQZOUAVdg== Received: from SN1NAM02FT032.eop-nam02.prod.protection.outlook.com (10.152.72.55) by SN1NAM02HT186.eop-nam02.prod.protection.outlook.com (10.152.72.136) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.1075.5; Mon, 15 May 2017 08:44:58 +0000 Received: from BN3PR16MB0868.namprd16.prod.outlook.com (10.152.72.56) by SN1NAM02FT032.mail.protection.outlook.com (10.152.72.126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1075.5 via Frontend Transport; Mon, 15 May 2017 08:44:58 +0000 Received: from BN3PR16MB0868.namprd16.prod.outlook.com ([10.165.81.146]) by BN3PR16MB0868.namprd16.prod.outlook.com ([10.165.81.146]) with mapi id 15.01.1084.029; Mon, 15 May 2017 08:44:58 +0000 From: Tingmao Lin To: "user@kylin.apache.org" Subject: Re: Questions about SUM behavior when rewritten as TOPN Thread-Topic: Questions about SUM behavior when rewritten as TOPN Thread-Index: AQHSyy1zRw1uvbb85UGoDgSXBcLC3aHy+diAgACSuY2AAFWegIAAMPGkgACOKICAAFzg9A== Date: Mon, 15 May 2017 08:44:57 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: kylin.apache.org; dkim=none (message not signed) header.d=none;kylin.apache.org; dmarc=none action=none header.from=msn.com; x-incomingtopheadermarker: OriginalChecksum:CFE51ADE5D64F798E748EFD30BE088F77AF615AFE09101F0A0D72CBF95204B68;UpperCasedChecksum:CB0704021CD7D8E375AA3DBC4AF1ACBE3C987FEEE3E4190C752A53D4290E6DF9;SizeAsReceived:8575;Count:44 x-tmn: [VoUNQ3J2sKUxoWbY8WRU7am2Ow099MRR] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;SN1NAM02HT186;5:Pd2+hMPN2OokZ4zeD7yD/YIiszaLtI2nA/s92GG2eLcfZo0m5JPVI99gmLyKWcUalf2cW8jj0ZjTgVcvLGK0QqTgAIFkEsz/D34/jDtdoVLK7eo9GJ5mfgJjYU0dN2w3nHl81pWQJtU5O/3sA6NFYA==;24:2G0AeKnu1Q86XXKBIDbc0Xo9X7d6X2MAXrA9NUpQUE//8ZdGzGKisM+GMAU/ksCBNZLRR3Y0Ix5q8nQMWpiONgX+RmIfF+S5XQ1LTsRyomw=;7:9DYBWiiNS5a9V4i9k/qCQacUjP9NZGZantgAAa2iS1F6fJkOyc9DrD6zMHEv5Oh51yzH3ySHMbY/fPNzhFD//DQGiNz4YUw3HoYPG99kIij9xJ90XJhKsT2qZeCZxTSmPED27RtaZHzsLCFPzvgJ8uhOrT/hTusBbzsgMCUmymEMSlmNg4+zZ3cIGgB2Es4QXu+JyjQw3mglJWrCB+8+wnMUInxuEV3Pbqyac7t56Pd18ie4UeaPZeMhNfBtBGc2nutGDsoyZff+Zl7mZVZZ+lxHI5POF+/V47URUep0bX4xK9Sy7+f1XpmftlW2UJ2v x-incomingheadercount: 44 x-eopattributedmessage: 0 x-forefront-antispam-report: EFV:NLI;SFV:NSPM;SFS:(7070007)(98901004);DIR:OUT;SFP:1901;SCL:1;SRVR:SN1NAM02HT186;H:BN3PR16MB0868.namprd16.prod.outlook.com;FPR:;SPF:None;LANG:en; x-ms-office365-filtering-correlation-id: 3e397c65-5144-4544-3c98-08d49b6eaae7 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(201702061074)(5061506573)(5061507331)(1603103135)(2017031320274)(201702181274)(2017031322274)(1601125374)(1603101448)(1701031045);SRVR:SN1NAM02HT186; x-ms-exchange-slblob-mailprops: ERoE7pfUhoaT6oyeTGts1xVZuPg8qI/Rz7ndj735aBRFTFhLJ68kZoF6herAmcDwoAEG3X29xOy3GJkJOCiKGn5dD5HDRG6ldi6mkS6ta5WML4nZnk4EGqQqe8Eg94qQ39By1zGfrlFRX5qWKMyVN4esTlU+dfbFAkvU7b1eAjZQmKkD6K19mYOK/Whm/S8lZWBHfFDGQDYWdrDMMyzwyUNR00Gf18KQvEAqjhx9JSqPJPQ4mHzrhl8dX+LyqKX8gFyNS8IkimSTiIUEPBOF3YI6MGlV8aACuHShyWHPB4VIXbE+PwnOsl/oXYjD4hUvqRIqBwM9UCekf25ua5b8B8wQ6t7movRFYF/JFGFqVI/Y/1e4Ox5WTR5LWZi6zbgNZhCNgUYfxl1SONDztYMbIOqJhK7yNjSorVVbJTaB2v6hqkv1/34QCgV7m3Spqs/F9msVk5t151JjRKBiepxjZTBEtWgakS9z0qedJgTgh0q2F0Ag/ut+EUo62Hmq0rXHs1m+BNGSFDJixwAYoErUxQjzJjFbev0qw/sZxIHoZ1N9S7/7xCjgcHPXyf7EBdYUILRZE9LVm6c2Re80QzB7q00nbj5O6MAX+R3HMFi0vs23Ezp7c3s/Gd+EZVmgF7dQhZfwABlOmaXY8ufHdd5D9hqj86pTIzTPobY9ASYzU8o3459E86O70w== x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(444000031);SRVR:SN1NAM02HT186;BCL:0;PCL:0;RULEID:;SRVR:SN1NAM02HT186; x-forefront-prvs: 0308EE423E spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: multipart/alternative; boundary="_000_BN3PR16MB0868A1658F4802909D9F24268EE10BN3PR16MB0868namp_" MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-originalarrivaltime: 15 May 2017 08:44:57.9719 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1NAM02HT186 archived-at: Mon, 15 May 2017 08:45:18 -0000 --_000_BN3PR16MB0868A1658F4802909D9F24268EE10BN3PR16MB0868namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable The query log and the cube definition is here: Query without rewrite 2017-05-15 14:59:50,741 INFO [pool-8-thread-1] threadpool.DefaultScheduler= :118 : Job Fetcher: 0 should running, 0 actual running, 0 ready, 577 alread= y succeed, 19 error, 11 discarded, 0 others 2017-05-15 14:59:51,637 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:336 : Using project: TEST 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:337 : The original query: select sum(v),count(v) = from test group by lv2_id 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:440 : The corrected query: select sum(v),count(v) = from test group by lv2_id LIMIT 50000 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:48 : The project manager's reference is org.apache.= kylin.metadata.project.ProjectManager@ac216f8 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and proj= ect=3DTEST : CUBE[name=3DTEST1ROWS1DIM] 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RemoveBlackoutRealizationsRule, realizations before: [TEST1ROWS= 1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RemoveUncapableRealizationsRule, realizations before: [TEST1ROW= S1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] rules.RealizationSortRule:40 : CUBE[name=3DTEST1ROWS1DIM] priority 1 co= st 73. 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RealizationSortRule, realizations before: [TEST1ROWS1DIM(CUBE)]= , realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUB= E)] And the final chosen one is the first one 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=3DTEST1RO= WS1DIM, cuboidId=3D1, groupsD=3D[DEFAULT.TEST.LV2_ID], otherDimsD=3D[] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2= _ID] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] storage.StorageContext:134 : Not enabling limit push down because the l= imit(including offset) 50000 is larger than kylin.query.pushdown.limit.max = 10000 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29= 542_29543 2017-05-15 14:59:51,763 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes, ra= wScanBytesString 40 bytes 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment TEST1ROWS1D= IM[29542_29543] is as below with 1 separate raw scans, shard part of start/= end key is set to 0 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exa= ct match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 (= \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: \x00\x00\x00\x00\x00\x= 00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x0= 0), No Fuzzy Key 2017-05-15 14:59:51,768 INFO [kylin-coproc--pool14-t14] v2.CubeHBaseEndpoi= ntRPC:200 : Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ Shard= \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x= 31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\= x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63= \x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1= . Total filtered/aggred row: 0. Time elapsed in EP: 1(ms). Server CPU usage= : 0.04998115611708333, server physical mem left: 8.621008896E10, server swa= p mem left:0.0.Etc message: start latency: 58@0,agg done@1,compress done@1,= server stats done@1, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.= Normal Complete: true. 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:576 : Scan count for each storageContext: 1, 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:359 : Stats of SQL response: isException: false, d= uration: 133, total scan count 1 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-1= 04] service.QueryService:284 : =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 SQL: select sum(v),count(v) from test group by lv2_id LIMIT 50000 User: ADMIN Success: true Duration: 0.133 Project: TEST Realization Names: [TEST1ROWS1DIM] Cuboid Ids: [1] Total scan count: 1 Result row count: 1 Accept Partial: true Is Partial Result: false Hit Exception Cache: false Storage cache used: false Message: null =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Query with rewrite: 2017-05-15 14:59:34,156 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:336 : Using project: TEST 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:337 : The original query: select sum(v) from test= group by lv2_id 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:440 : The corrected query: select sum(v) from test= group by lv2_id LIMIT 50000 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:48 : The project manager's reference is org.apache.= kylin.metadata.project.ProjectManager@ac216f8 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and proj= ect=3DTEST : CUBE[name=3DTEST1ROWS1DIM] 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RemoveBlackoutRealizationsRule, realizations before: [TEST1ROWS= 1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] cube.CubeCapabilityChecker:191 : Cube CUBE[name=3DTEST1ROWS1DIM] Capabi= lityInfluences: TOP10@class org.apache.kylin.measure.topn.TopNMeasureType 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RemoveUncapableRealizationsRule, realizations before: [TEST1ROW= S1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] rules.RealizationSortRule:40 : CUBE[name=3DTEST1ROWS1DIM] priority 1 co= st 21. 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.ro= uting.rules.RealizationSortRule, realizations before: [TEST1ROWS1DIM(CUBE)]= , realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUB= E)] And the final chosen one is the first one 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=3D= SUM, parameter=3DParameterDesc [type=3Dcolumn, value=3DV, nextParam=3Dnull]= , returnType=3Dbigint] to FunctionDesc [expression=3DTOP_N, parameter=3DPar= ameterDesc [type=3Dcolumn, value=3DV, nextParam=3DParameterDesc [type=3Dcol= umn, value=3DLV2_ID, nextParam=3Dnull]], returnType=3Dtopn(10)] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=3DTEST1RO= WS1DIM, cuboidId=3D1, groupsD=3D[], otherDimsD=3D[] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2= _ID] 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] storage.StorageContext:134 : Not enabling limit push down because the l= imit(including offset) 50000 is larger than kylin.query.pushdown.limit.max = 10000 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 ro= ws 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29= 542_29543 2017-05-15 14:59:34,219 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] dict.DictionaryManager:420 : DictionaryManager(1833236614) loading Dict= ionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3d6b1-0ed4-4b2= d-9015-0c92b0757725.dict 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes, ra= wScanBytesString 40 bytes 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment TEST1ROWS1D= IM[29542_29543] is as below with 1 separate raw scans, shard part of start/= end key is set to 0 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid req= uire post aggregation, from 0 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\= x00\x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: \x00\x00\= x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x= 00\x01\xFF\x00), No Fuzzy Key 2017-05-15 14:59:34,256 INFO [kylin-coproc--pool14-t13] v2.CubeHBaseEndpoi= ntRPC:200 : Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ Shard= \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x= 31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\= x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63= \x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1= . Total filtered/aggred row: 0. Time elapsed in EP: 2(ms). Server CPU usage= : 0.07331964572432702, server physical mem left: 8.619999232E10, server swa= p mem left:0.0.Etc message: start latency: 33@1,agg done@1,compress done@1,= server stats done@2, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.= Normal Complete: true. 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:576 : Scan count for each storageContext: 1, 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:359 : Stats of SQL response: isException: false, d= uration: 102, total scan count 1 2017-05-15 14:59:34,260 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-1= 08] service.QueryService:284 : =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Query Id: 54ac350a-7676-4601-a109-04dec543e56e SQL: select sum(v) from test group by lv2_id LIMIT 50000 User: ADMIN Success: true Duration: 0.102 Project: TEST Realization Names: [TEST1ROWS1DIM] Cuboid Ids: [1] Total scan count: 1 Result row count: 1 Accept Partial: true Is Partial Result: false Hit Exception Cache: false Storage cache used: false Message: null =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D cube definition: { "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75", "last_modified": 1494573765171, "version": "1.6.0", "name": "TEST1ROWS1DIM", "model_name": "TEST", "description": "", "null_string": null, "dimensions": [ { "name": "LV2_ID", "table": "DEFAULT.TEST", "column": "LV2_ID", "derived": null }, { "name": "MINUTE_START", "table": "DEFAULT.TEST", "column": "MINUTE_START", "derived": null } ], "measures": [ { "name": "_COUNT_", "function": { "expression": "COUNT", "parameter": { "type": "constant", "value": "1", "next_parameter": null }, "returntype": "bigint" }, "dependent_measure_ref": null }, { "name": "COUNT", "function": { "expression": "SUM", "parameter": { "type": "column", "value": "V", "next_parameter": null }, "returntype": "bigint" }, "dependent_measure_ref": null }, { "name": "TOP10", "function": { "expression": "TOP_N", "parameter": { "type": "column", "value": "V", "next_parameter": { "type": "column", "value": "LV2_ID", "next_parameter": null } }, "returntype": "topn(10)", "configuration": { "topn.encoding.LV2_ID": "dict" } }, "dependent_measure_ref": null } ], "dictionaries": [], "rowkey": { "rowkey_columns": [ { "column": "MINUTE_START", "encoding": "time", "isShardBy": false }, { "column": "LV2_ID", "encoding": "dict", "isShardBy": false } ] }, "hbase_mapping": { "column_family": [ { "name": "F1", "columns": [ { "qualifier": "M", "measure_refs": [ "_COUNT_", "COUNT", "TOP10" ] } ] } ] }, "aggregation_groups": [ { "includes": [ "MINUTE_START", "LV2_ID" ], "select_rule": { "hierarchy_dims": [], "mandatory_dims": [], "joint_dims": [] } } ], "signature": "o75jqwjKPFF859EyU6j9eA=3D=3D", "notify_list": [], "status_need_notify": [ "ERROR", "DISCARDED", "SUCCEED" ], "partition_date_start": 0, "partition_date_end": 3153600000000, "auto_merge_time_ranges": [ 604800000, 2419200000 ], "retention_range": 0, "engine_type": 2, "storage_type": 2, "override_kylin_properties": {} } ________________________________ From: ShaoFeng Shi Sent: Monday, May 15, 2017 1:37 AM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Hi Tingmao, Your finding is correct; Kylin didn't check whether there is "order by limi= t" pattern or not. The main reason is, the SQL parser didn't push down the = "order by limit" info here (at least when TopN was developed), so I couldn'= t leverage that. Another reason is usually we won't put the "group and ord= er" column as a normal cube dimension (but Kylin didn't stop you from doing= that). Please feel free to open a JIRA on the checking "order by limit" pattern, I= can check that later. 2017-05-15 2:37 GMT+08:00 Tingmao Lin >: Yeah, we looked into the TopNCounter source code and found that for low car= dinality scenario, in TopNCounter, m1=3Dm2=3D0 so it just sum up the values= . But the result still goes wrong. I will collect more information for inve= stigation ASAP :) I have a question: as Billy Liu said in this thread kylin will check the OR= DER BY clause to determine whether to rewrite. But I didn't find any acces= s to SQLdigest.sortColumns in TopNMeasureType.influenceCapabilityCheck().= Does kylin check it elsewhere ? It seems that if sum(measure) is the = only measure in the query and the group by column matches, then TopNMeasu= reType.isTopNCompatibleSum() will pass and the query get rewritten. This= confuses the user since they may expect a accurate result for every distin= ct value of group by column(s). ________________________________ From: ShaoFeng Shi > Sent: Sunday, May 14, 2017 2:13 PM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Agree with Yang's points; When cardinality is small than the TopN counter's= capacity, the result should be accurate. I checked the 1.6 source code and= didn't find clue. Please share more information (cube definition and query= logs) for investigation, thanks! For negtive number in TopN, actually that isn't recommended, as it goes aga= inst TopN's purpose, which is to counting something happened. When merging = two TopN counters, one counter will use another's last element's number to = accumulate (if another is full) on its elements (as a supplement). If the l= ast element is close to 0, its impaction will be minor. But if the last ele= ment is a big negative value, you know it's impaction will be considerable!= It doesn't make sense to reduce existing element's counting value if the = last element's value is negative. So please use it properly in your scenari= o. Ofcourse, I think Kylin should also add more checks there. 2017-05-14 17:18 GMT+08:00 Tingmao Lin >: The SQL in the original email is exactly what we input in the "insight" tab= in kylin admin UI. I do not have access to the host running kylin now ,and I will post the det= ailed log output tomorrow. We reproduced the inaccurate result behavior using a source table with <10 = rows and I should be able to write a reproduce step tomorrow. ________________________________ From: Billy Liu > Sent: Sunday, May 14, 2017 12:21 AM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Thanks Tingmao for the report. Could you show us the complete SQL? In your SQL, there is no order by state= ment. If no ORDER BY, the query should not be rewritten into TopN measure. 2017-05-12 23:52 GMT+08:00 Tingmao Lin >: Hi, We found that SUM() query on a cardinality 1 dimension is not accurate (or = "not correct") when automatically rewritten as TOPN. Is that the expected behavior of kylin or there are any other issue? We built a cube on a table ( measure1: bigint, dim1_id:varchar, dim2_id:var= char, ... ) using kylin 1.6.0 (Kafka streaming source) The cube has two measures: SUM(measure1) and TOPN(10,sum-orderby(measure1),= group by dim2_id) . (other measures omitted) and two dimensions dim1_id, dim2_id (other dims omitted) About the source table data: The cardinality of dim1_id is 1 (same dim1_id for all rows in the source t= able) The cardinality of dim2_id is 1 (same dim2_id for all rows in the source t= able) The possible value of measure1 is [1,0,-1] When we query "select SUM(measure1) FROM table GROUP BY dim2_id" = =3D> the result has one row:"sum=3D7", from the kylin logs we found that the query has been automatically r= ewritten as TOPN(measure1,sum-orderby(measure1),group by dim2_id) When we write another query to prevent TOPN rewrite, for example: "select SUM(measure1),count(*) FROM table GROUP BY dim2_id" =3D> o= ne row -- "sum=3D-2,count=3D24576" "select SUM(measure1),count(*) FROM table" = =3D> one row -- "sum=3D-2,count=3D24576" The result is different (7 and -2) when rewritting to TOPN or not. My question is: are the following behavior "works as expected" ,or TOPN alg= orithm does not support negative counter values very well , or any issue th= ere? 1. SUM() query automatically rewritten as TOPN and gives approximated resu= lt when no TOPN present in the query. 2. When cardinality is 1, TOPN does not give accurate result. Thanks. -- Best regards, Shaofeng Shi ??? -- Best regards, Shaofeng Shi ??? --_000_BN3PR16MB0868A1658F4802909D9F24268EE10BN3PR16MB0868namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

The query log and the cube definition is here:



Query without rewrite
2017-05-15 14:59:50,741 INFO  [pool-8-thread-1] threadpool.Defaul= tScheduler:118 : Job Fetcher: 0 should running, 0 actual running, 0 ready, = 577 already succeed, 19 error, 11 discarded, 0 others
2017-05-15 14:59:51,637 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:336 : Using project: TEST
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:337 : The original query:  select s= um(v),count(v) from test group by lv2_id
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:440 : The corrected query: select sum(v)= ,count(v) from test group by lv2_id
LIMIT 50000
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:48 : The project manager's reference is o= rg.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:60 : Find candidates by table DEFAULT.TES= T and project=3DTEST : CUBE[name=3DTEST1ROWS1DIM]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: = [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RemoveUncapableRealizationsRule, realizations before:= [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] rules.RealizationSortRule:40 : CUBE[name=3DTEST1ROWS1DIM] pri= ority 1 cost 73. 
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RealizationSortRule, realizations before: [TEST1ROWS1= DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] routing.QueryRouter:72 : The realizations remaining: [TEST1RO= WS1DIM(CUBE)] And the final chosen one is the first one
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube= =3DTEST1ROWS1DIM, cuboidId=3D1, groupsD=3D[DEFAULT.TEST.LV2_ID], otherDimsD= =3D[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.L= V2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAUL= T.TEST.LV2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.GTCubeStorageQueryBase:238 : Does not need storage a= ggregation
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] storage.StorageContext:134 : Not enabling limit push down bec= ause the limit(including offset) 50000 is larger than kylin.query.pushdown.= limit.max 10000
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for = segment 29542_29543
2017-05-15 14:59:51,763 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311= bytes, rawScanBytesString 40 bytes
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment T= EST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part= of start/end key is set to 0
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: = cuboid exact match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00= \x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00= \x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
2017-05-15 14:59:51,768 INFO  [kylin-coproc--pool14-t14] v2.CubeH= BaseEndpointRPC:200 : <sub-thread for Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC returned from HTable KYLIN_= EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x= 4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\= x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64= \x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time= elapsed in EP: 1(ms). Server CPU usage: 0.04998115611708333, server physic= al mem left: 8.621008896E10, server swap mem left:0.0.Etc message: start la= tency: 58@0,agg done@1,compress done@1,server stats done@1, debugGitTag:0fca07178088320fda44513c63a1134b7d= 39427c;.Normal Complete: true.
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:576 : Scan count for each storageContext= : 1,
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:359 : Stats of SQL response: isException= : false, duration: 133, total scan count 1
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf= 38ebcd20-104] service.QueryService:284 : 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
SQL: select sum(v),count(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.133
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D





Query with rewrite:
2017-05-15 14:59:34,156 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:336 : Using project: TEST
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:337 : The original query:  select s= um(v) from test group by lv2_id
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:440 : The corrected query: select sum(v)= from test group by lv2_id
LIMIT 50000
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:48 : The project manager's reference is o= rg.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:60 : Find candidates by table DEFAULT.TES= T and project=3DTEST : CUBE[name=3DTEST1ROWS1DIM]
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: = [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] cube.CubeCapabilityChecker:191 : Cube CUBE[name=3DTEST1ROWS1D= IM] CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.TopNMea= sureType
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RemoveUncapableRealizationsRule, realizations before:= [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] rules.RealizationSortRule:40 : CUBE[name=3DTEST1ROWS1DIM] pri= ority 1 cost 21. 
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kyli= n.query.routing.rules.RealizationSortRule, realizations before: [TEST1ROWS1= DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] routing.QueryRouter:72 : The realizations remaining: [TEST1RO= WS1DIM(CUBE)] And the final chosen one is the first one
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] topn.TopNMeasureType:338 : Rewrite function FunctionDesc [exp= ression=3DSUM, parameter=3DParameterDesc [type=3Dcolumn, value=3DV, nextPar= am=3Dnull], returnType=3Dbigint] to FunctionDesc [expression=3DTOP_N, parameter=3DParameterDesc [type=3Dcolumn, value=3DV, = nextParam=3DParameterDesc [type=3Dcolumn, value=3DLV2_ID, nextParam=3Dnull]= ], returnType=3Dtopn(10)]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube= =3DTEST1ROWS1DIM, cuboidId=3D1, groupsD=3D[], otherDimsD=3D[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAUL= T.TEST.LV2_ID]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregatio= n
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] storage.StorageContext:134 : Not enabling limit push down bec= ause the limit(including offset) 50000 is larger than kylin.query.pushdown.= limit.max 10000
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to= 805306 rows
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for = segment 29542_29543
2017-05-15 14:59:34,219 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] dict.DictionaryManager:420 : DictionaryManager(1833236614) lo= ading DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3d6b= 1-0ed4-4b2d-9015-0c92b0757725.dict
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306= bytes, rawScanBytesString 40 bytes
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment T= EST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part= of start/end key is set to 0
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: = cuboid require post aggregation, from 0 to 1 Start: \x00\x00\x00\x00\x00\x0= 0\x00\x00\x00\x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00= \x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
2017-05-15 14:59:34,256 INFO  [kylin-coproc--pool14-t13] v2.CubeH= BaseEndpointRPC:200 : <sub-thread for Query 54ac350a-7676-4601-a109-04de= c543e56e GTScanRequest 37ca24c9>Endpoint RPC returned from HTable KYLIN_= EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x= 4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\= x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64= \x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time= elapsed in EP: 2(ms). Server CPU usage: 0.07331964572432702, server physic= al mem left: 8.619999232E10, server swap mem left:0.0.Etc message: start la= tency: 33@1,agg done@1,compress done@1,server stats done@2, debugGitTag:0fca07178088320fda44513c63a1134b7d= 39427c;.Normal Complete: true.
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:576 : Scan count for each storageContext= : 1,
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:359 : Stats of SQL response: isException= : false, duration: 102, total scan count 1
2017-05-15 14:59:34,260 INFO  [Query 54ac350a-7676-4601-a109-04de= c543e56e-108] service.QueryService:284 : 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Query Id: 54ac350a-7676-4601-a109-04dec543e56e
SQL: select sum(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.102
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D[QUERY]=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D





cube definition:

{
  "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75&qu= ot;,
  "last_modified": 1494573765171,
  "version": "1.6.0",
  "name": "TEST1ROWS1DIM",
  "model_name": "TEST",
  "description": "",
  "null_string": null,
  "dimensions": [
    {
      "name": "LV2_ID",
      "table": "DEFAULT.TEST",
      "column": "LV2_ID",
      "derived": null
    },
    {
      "name": "MINUTE_START",
      "table": "DEFAULT.TEST",
      "column": "MINUTE_START",
      "derived": null
    }
  ],
  "measures": [
    {
      "name": "_COUNT_",
      "function": {
        "expression": "COUNT",=
        "parameter": {
          "type": "constant&qu= ot;,
          "value": "1",
          "next_parameter": null
        },
        "returntype": "bigint"=
      },
      "dependent_measure_ref": null
    },
    {
      "name": "COUNT",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column"= ;,
          "value": "V",
          "next_parameter": null
        },
        "returntype": "bigint"=
      },
      "dependent_measure_ref": null
    },
    {
      "name": "TOP10",
      "function": {
        "expression": "TOP_N",=
        "parameter": {
          "type": "column"= ;,
          "value": "V",
          "next_parameter": {
            "type": "colu= mn",
            "value": "LV2= _ID",
            "next_parameter": = null
          }
        },
        "returntype": "topn(10)&quo= t;,
        "configuration": {
          "topn.encoding.LV2_ID": &= quot;dict"
        }
      },
      "dependent_measure_ref": null
    }
  ],
  "dictionaries": [],
  "rowkey": {
    "rowkey_columns": [
      {
        "column": "MINUTE_START&quo= t;,
        "encoding": "time",
        "isShardBy": false
      },
      {
        "column": "LV2_ID",
        "encoding": "dict",
        "isShardBy": false
      }
    ]
  },
  "hbase_mapping": {
    "column_family": [
      {
        "name": "F1",
        "columns": [
          {
            "qualifier": "= ;M",
            "measure_refs": [<= /div>
              "_COUNT_",<= /div>
              "COUNT",
              "TOP10"
            ]
          }
        ]
      }
    ]
  },
  "aggregation_groups": [
    {
      "includes": [
        "MINUTE_START",
        "LV2_ID"
      ],
      "select_rule": {
        "hierarchy_dims": [],
        "mandatory_dims": [],
        "joint_dims": []
      }
    }
  ],
  "signature": "o75jqwjKPFF859EyU6j9eA=3D=3D"= ,
  "notify_list": [],
  "status_need_notify": [
    "ERROR",
    "DISCARDED",
    "SUCCEED"
  ],
  "partition_date_start": 0,
  "partition_date_end": 3153600000000,
  "auto_merge_time_ranges": [
    604800000,
    2419200000
  ],
  "retention_range": 0,
  "engine_type": 2,
  "storage_type": 2,
  "override_kylin_properties": {}
}




From: ShaoFeng Shi <shao= fengshi@apache.org>
Sent: Monday, May 15, 2017 1:37 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN
 
Hi Tingmao,

Your finding is correct; Kylin didn't check whether there is "ord= er by limit" pattern or not. The main reason is, the SQL parser didn't= push down the "order by limit" info here (at least when TopN was= developed), so I couldn't leverage that.  Another reason is usually we won't put the "group and order" column as a normal= cube dimension (but Kylin didn't stop you from doing that). 

Please feel free to open a JIRA on the checking "order by limit&q= uot; pattern, I can check that later. 

  

2017-05-15 2:37 GMT+08:00 Tingmao Lin <Z089@msn.= com>:

Yeah, we looked into the TopNCounter source code and found tha= t for low cardinality scenario, in TopNCounter, m1=3Dm2=3D0 so&n= bsp;it just sum up the values. But the result still goes wrong. I will collect more information for investigation&= nbsp;ASAP :)


I have a question: as Billy Liu s= aid in this thread kylin will check the ORDER BY clause to determine whether to r= ewrite. But  I didn't find any access to SQLdigest.sortColumns <= /span> in  TopNMeasureType.influenceCapabilityCheck().  Does kylin check it elsewhere ?  It seems that  if   sum(measure) = is the only measure in the query  and the group by  column matche= s, then TopNMeasureType.isTopNCompatibleSum()    will pass and the query get rewritten. This confuses the user since they may ex= pect a accurate result for every distinct value of group by column(s). = ;



From:= ShaoFeng Shi <shaofengshi@apache.org>
Sent: Sunday, May 14, 2017 2:13 PM

To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN
 
Agree with Yang's points; When cardinality is small than t= he TopN counter's capacity, the result should be accurate. I checked the 1.= 6 source code and didn't find clue. Please share more information (cube def= inition and query logs) for investigation, thanks!

For negtive number in TopN, actually that isn't recommended, as it goe= s against TopN's purpose, which is to counting something happened. When mer= ging two TopN counters, one counter will use another's last element's numbe= r to accumulate (if another is full) on its elements (as a supplement). If the last element is close to 0, its = impaction will be minor. But if the last element is a big negative value, y= ou know it's impaction will be considerable!  It doesn't make sense to= reduce existing element's counting value if the last element's value is negative. So please use it properly in your= scenario. Ofcourse, I think Kylin should also add more checks there.

2017-05-14 17:18 GMT+08:00 Tingmao Lin <Z089@msn= .com>:

The SQL in the original email is exactly what we input in the "= ;insight" tab in kylin admin UI.

I do not have access to the host running kylin now ,and I will post the = detailed log output tomorrow.


We reproduced the inaccurate result behavior using a source table w= ith <10 rows and  I should be able to write a reproduce step tomorr= ow.



From: Billy Liu <billyliu@apache.org>
Sent: Sunday, May 14, 2017 12:21 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN
 
Thanks Tingmao for the report.

Could you show us the complete SQL? In your SQL, there is no order by = statement. If no ORDER BY, the query should not be rewritten into TopN meas= ure. 

2017-05-12 23:52 GMT+08:00 Tingmao Lin <Z089@msn= .com>:

Hi,

We found that SUM() query on a cardinality 1 dimension is not accurate (or "not correct") when automatically  rewritt= en as TOPN.
Is that the expected behavior of kylin or there are any other issue?

We built a cube on a table ( measure1: bigint, dim1_id:varchar, dim2_i= d:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)

The cube has two measures: SUM(measure1) and TOPN(10,sum-orderby(measure1),= group by dim2_id) . (other measures omitted)
and two dimensions  dim1_id, dim2_id   (other dims omitted)<= /div>

About the source table data:  
The cardinality of dim1_id  is 1 (same dim1_id for all rows in the source table<= /span>)=
The cardinality of dim2_id  is 1 (same dim2_id for all rows in th= e source table)
The possible value of measure1 is [1,0,-1]
 
When we query
    "select SUM(measure1) FROM table GROUP BY dim2= _id"                   &n= bsp;=3D>     the result has one row:"sum=3D7",
      from the kylin logs we found that the query has b= een automatically  rewritten as TOPN(measure1,sum-orderby(measur= e1),group by dim2_id)

When we write another query to prevent TOPN rewrite, for example: 

&nbs= p;  "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =3D>   one = row -- "sum=3D-2,count=3D24576"

&nbs= p;   one row -- "sum=3D-2,count=3D24576"


The = result is different (7 and -2) when rewritting to TOPN or not.<= /p>


My&n= bsp;question is: are the following behavior "works as expect= ed" ,or TOPN algorithm does not support negative counter values very well , or any issue the= re?


1. SUM() query  automatically rewritten as TOPN and gives = ;approximated result when no TOPN present in the query.

2. When cardinality is 1, TOPN does not give accurate result.




Thanks.







--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋

--_000_BN3PR16MB0868A1658F4802909D9F24268EE10BN3PR16MB0868namp_--