kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tingmao Lin <Z...@msn.com>
Subject Re: Questions about SUM behavior when rewritten as TOPN
Date Mon, 15 May 2017 11:22:10 GMT
Yes, the serializer just don't work with negative counters. That explains why the result is
inaccurate and it's not a bug.


About the SUM auto-rewrite behavior,

I have created jira: https://issues.apache.org/jira/browse/KYLIN-2620

Thank Shi ShaoFeng, Li Yang and Billy Liu for help resolving the questions.

________________________________
From: ShaoFeng Shi <shaofengshi@apache.org>
Sent: Monday, May 15, 2017 10:36 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Ok then it explains.

Top N algorithm is based on the assumption that your data follows Zipf distribution (https://en.wikipedia.org/wiki/Zipf%27s_law).
If not, the result can be very inaccurate. Your sample is such a case, with only 1 key and
negative counters, the error be accumulated and finally see a variation from the real value,
which is not bug I think. Please use SUM measure, instead of abusing TopN.
[https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Zipf_distribution_PMF.png/325px-Zipf_distribution_PMF.png]<https://en.wikipedia.org/wiki/Zipf%27s_law>

Zipf's law - Wikipedia<https://en.wikipedia.org/wiki/Zipf%27s_law>
en.wikipedia.org
Zipf's law (/ ˈ z ɪ f /) is an empirical law formulated using mathematical statistics that
refers to the fact that many types of data studied in the physical and ...



2017-05-15 16:55 GMT+08:00 Tingmao Lin <Z089@msn.com<mailto:Z089@msn.com>>:

We found something about serialization precision that might be related to the inaccurate result
problem.


the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double array.  The DoubleDeltaSerializer
will write double values as  int(d*1000+0.5)/1000 so for each serialization process, negative
integer value increases by 0.001.


Test case:    (slightly modified from TopNCounterSerializerTest.testSerialization)


@Test
public void testNegativeSerialization() {
    TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50);
    Integer[] stream = {1};
    double[] incValues = {-1};
    for (int counter = 0; counter < stream.length; ++counter) {
        Integer i = stream[counter]; double v = incValues[counter];
        vs.offer(new ByteArray(Bytes.toBytes(i)), v);
    }
    vs.sortAndRetain();
    ByteBuffer out = ByteBuffer.allocate(1024);
    serializer.serialize(vs, out);

    byte[] copyBytes = new byte[out.position()];
    System.arraycopy(out.array(), 0, copyBytes, 0, out.position());

    ByteBuffer in = ByteBuffer.wrap(copyBytes);
    TopNCounter<ByteArray> vsNew = serializer.deserialize(in);

    Assert.assertEquals(vs.toString(), vsNew.toString());
}


and result is

org.junit.ComparisonFailure:
Expected :[\x00\x00\x00\x01:-1.0]
Actual   :[\x00\x00\x00\x01:-0.999]



so if we set type of measure1 as double, and build the cube using a single line {measure1
= -1.0}
the "select sum(measure1) from table group by dim2_id"  will return  "-0.994" instead of "-1"


________________________________
From: Tingmao Lin <Z089@msn.com<mailto:Z089@msn.com>>
Sent: Monday, May 15, 2017 8:44 AM
To: user@kylin.apache.org<mailto:user@kylin.apache.org>

Subject: Re: Questions about SUM behavior when rewritten as TOPN


The query log and the cube definition is here:


Query without rewrite
2017-05-15 14:59:50,741 INFO  [pool-8-thread-1] threadpool.DefaultScheduler:118 : Job Fetcher:
0 should running, 0 actual running, 0 ready, 577 already succeed, 19 error, 11 discarded,
0 others
2017-05-15 14:59:51,637 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:336
: Using project: TEST
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:337
: The original query:  select sum(v),count(v) from test group by lv2_id
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:440
: The corrected query: select sum(v),count(v) from test group by lv2_id
LIMIT 50000
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:48
: The project manager's reference is org.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:60
: Find candidates by table DEFAULT.TEST and project=TEST : CUBE[name=TEST1ROWS1DIM]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] rules.RealizationSortRule:40
: CUBE[name=TEST1ROWS1DIM] priority 1 cost 73.
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:72
: The realizations remaining: [TEST1ROWS1DIM(CUBE)] And the final chosen one is the first
one
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:105
: Cuboid identified: cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:230
: GroupD :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:231
: SingleValueD :[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:232
: Cuboid columns :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:238
: Does not need storage aggregation
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] storage.StorageContext:134
: Not enabling limit push down because the limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
10000
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.CubeSegmentScanner:57
: Init CubeSegmentScanner for segment 29542_29543
2017-05-15 14:59:51,763 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseEndpointRPC:139
: Serialized scanRequestBytes 311 bytes, rawScanBytesString 40 bytes
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseEndpointRPC:141
: The scan 6fdd1fd2 for segment TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw
scans, shard part of start/end key is set to 0
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseRPC:278
: Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
2017-05-15 14:59:51,768 INFO  [kylin-coproc--pool14-t14] v2.CubeHBaseEndpointRPC:200 : <sub-thread
for Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC returned
from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E
on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed in EP:
1(ms). Server CPU usage: 0.04998115611708333, server physical mem left: 8.621008896E10, server
swap mem left:0.0.Etc message: start latency: 58@0,agg done@1,compress done@1,server stats
done@1, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true.
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:576
: Scan count for each storageContext: 1,
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:359
: Stats of SQL response: isException: false, duration: 133, total scan count 1
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:284
:
==========================[QUERY]===============================
Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
SQL: select sum(v),count(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.133
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================





Query with rewrite:
2017-05-15 14:59:34,156 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:336
: Using project: TEST
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:337
: The original query:  select sum(v) from test group by lv2_id
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:440
: The corrected query: select sum(v) from test group by lv2_id
LIMIT 50000
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:48
: The project manager's reference is org.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:60
: Find candidates by table DEFAULT.TEST and project=TEST : CUBE[name=TEST1ROWS1DIM]
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule,
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] cube.CubeCapabilityChecker:191
: Cube CUBE[name=TEST1ROWS1DIM] CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.TopNMeasureType
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] rules.RealizationSortRule:40
: CUBE[name=TEST1ROWS1DIM] priority 1 cost 21.
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51
: Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:72
: The realizations remaining: [TEST1ROWS1DIM(CUBE)] And the final chosen one is the first
one
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] topn.TopNMeasureType:338
: Rewrite function FunctionDesc [expression=SUM, parameter=ParameterDesc [type=column, value=V,
nextParam=null], returnType=bigint] to FunctionDesc [expression=TOP_N, parameter=ParameterDesc
[type=column, value=V, nextParam=ParameterDesc [type=column, value=LV2_ID, nextParam=null]],
returnType=topn(10)]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:105
: Cuboid identified: cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:230
: GroupD :[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:231
: SingleValueD :[]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:232
: Cuboid columns :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:241
: Need storage aggregation
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] storage.StorageContext:134
: Not enabling limit push down because the limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
10000
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:334
: Memory budget is set to 805306 rows
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.CubeSegmentScanner:57
: Init CubeSegmentScanner for segment 29542_29543
2017-05-15 14:59:34,219 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] dict.DictionaryManager:420
: DictionaryManager(1833236614) loading DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3d6b1-0ed4-4b2d-9015-0c92b0757725.dict
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseEndpointRPC:139
: Serialized scanRequestBytes 306 bytes, rawScanBytesString 40 bytes
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseEndpointRPC:141
: The scan 37ca24c9 for segment TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw
scans, shard part of start/end key is set to 0
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseRPC:278
: Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require post aggregation, from 0 to 1 Start:
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00)
Stop:  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00),
No Fuzzy Key
2017-05-15 14:59:34,256 INFO  [kylin-coproc--pool14-t13] v2.CubeHBaseEndpointRPC:200 : <sub-thread
for Query 54ac350a-7676-4601-a109-04dec543e56e GTScanRequest 37ca24c9>Endpoint RPC returned
from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E
on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed in EP:
2(ms). Server CPU usage: 0.07331964572432702, server physical mem left: 8.619999232E10, server
swap mem left:0.0.Etc message: start latency: 33@1,agg done@1,compress done@1,server stats
done@2, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true.
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:576
: Scan count for each storageContext: 1,
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:359
: Stats of SQL response: isException: false, duration: 102, total scan count 1
2017-05-15 14:59:34,260 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:284
:
==========================[QUERY]===============================
Query Id: 54ac350a-7676-4601-a109-04dec543e56e
SQL: select sum(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.102
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================





cube definition:

{
  "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75",
  "last_modified": 1494573765171,
  "version": "1.6.0",
  "name": "TEST1ROWS1DIM",
  "model_name": "TEST",
  "description": "",
  "null_string": null,
  "dimensions": [
    {
      "name": "LV2_ID",
      "table": "DEFAULT.TEST",
      "column": "LV2_ID",
      "derived": null
    },
    {
      "name": "MINUTE_START",
      "table": "DEFAULT.TEST",
      "column": "MINUTE_START",
      "derived": null
    }
  ],
  "measures": [
    {
      "name": "_COUNT_",
      "function": {
        "expression": "COUNT",
        "parameter": {
          "type": "constant",
          "value": "1",
          "next_parameter": null
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "name": "COUNT",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "V",
          "next_parameter": null
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "name": "TOP10",
      "function": {
        "expression": "TOP_N",
        "parameter": {
          "type": "column",
          "value": "V",
          "next_parameter": {
            "type": "column",
            "value": "LV2_ID",
            "next_parameter": null
          }
        },
        "returntype": "topn(10)",
        "configuration": {
          "topn.encoding.LV2_ID": "dict"
        }
      },
      "dependent_measure_ref": null
    }
  ],
  "dictionaries": [],
  "rowkey": {
    "rowkey_columns": [
      {
        "column": "MINUTE_START",
        "encoding": "time",
        "isShardBy": false
      },
      {
        "column": "LV2_ID",
        "encoding": "dict",
        "isShardBy": false
      }
    ]
  },
  "hbase_mapping": {
    "column_family": [
      {
        "name": "F1",
        "columns": [
          {
            "qualifier": "M",
            "measure_refs": [
              "_COUNT_",
              "COUNT",
              "TOP10"
            ]
          }
        ]
      }
    ]
  },
  "aggregation_groups": [
    {
      "includes": [
        "MINUTE_START",
        "LV2_ID"
      ],
      "select_rule": {
        "hierarchy_dims": [],
        "mandatory_dims": [],
        "joint_dims": []
      }
    }
  ],
  "signature": "o75jqwjKPFF859EyU6j9eA==",
  "notify_list": [],
  "status_need_notify": [
    "ERROR",
    "DISCARDED",
    "SUCCEED"
  ],
  "partition_date_start": 0,
  "partition_date_end": 3153600000000,
  "auto_merge_time_ranges": [
    604800000,
    2419200000<tel:024%201920%200000>
  ],
  "retention_range": 0,
  "engine_type": 2,
  "storage_type": 2,
  "override_kylin_properties": {}
}



________________________________
From: ShaoFeng Shi <shaofengshi@apache.org<mailto:shaofengshi@apache.org>>
Sent: Monday, May 15, 2017 1:37 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Hi Tingmao,

Your finding is correct; Kylin didn't check whether there is "order by limit" pattern or not.
The main reason is, the SQL parser didn't push down the "order by limit" info here (at least
when TopN was developed), so I couldn't leverage that.  Another reason is usually we won't
put the "group and order" column as a normal cube dimension (but Kylin didn't stop you from
doing that).

Please feel free to open a JIRA on the checking "order by limit" pattern, I can check that
later.



2017-05-15 2:37 GMT+08:00 Tingmao Lin <Z089@msn.com<mailto:Z089@msn.com>>:

Yeah, we looked into the TopNCounter source code and found that for low cardinality scenario,
in TopNCounter, m1=m2=0 so it just sum up the values. But the result still goes wrong. I will
collect more information for investigation ASAP :)


I have a question: as Billy Liu said in this thread kylin will check the ORDER BY clause to
determine whether to rewrite. But  I didn't find any access to SQLdigest.sortColumns  in 
TopNMeasureType.influenceCapabilityCheck().  Does kylin check it elsewhere ?  It seems that
 if   sum(measure) is the only measure in the query  and the group by  column matches, then
TopNMeasureType.isTopNCompatibleSum()    will pass and the query get rewritten. This confuses
the user since they may expect a accurate result for every distinct value of group by column(s).

________________________________
From: ShaoFeng Shi <shaofengshi@apache.org<mailto:shaofengshi@apache.org>>
Sent: Sunday, May 14, 2017 2:13 PM

To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Agree with Yang's points; When cardinality is small than the TopN counter's capacity, the
result should be accurate. I checked the 1.6 source code and didn't find clue. Please share
more information (cube definition and query logs) for investigation, thanks!

For negtive number in TopN, actually that isn't recommended, as it goes against TopN's purpose,
which is to counting something happened. When merging two TopN counters, one counter will
use another's last element's number to accumulate (if another is full) on its elements (as
a supplement). If the last element is close to 0, its impaction will be minor. But if the
last element is a big negative value, you know it's impaction will be considerable!  It doesn't
make sense to reduce existing element's counting value if the last element's value is negative.
So please use it properly in your scenario. Ofcourse, I think Kylin should also add more checks
there.

2017-05-14 17:18 GMT+08:00 Tingmao Lin <Z089@msn.com<mailto:Z089@msn.com>>:

The SQL in the original email is exactly what we input in the "insight" tab in kylin admin
UI.

I do not have access to the host running kylin now ,and I will post the detailed log output
tomorrow.


We reproduced the inaccurate result behavior using a source table with <10 rows and  I
should be able to write a reproduce step tomorrow.

________________________________
From: Billy Liu <billyliu@apache.org<mailto:billyliu@apache.org>>
Sent: Sunday, May 14, 2017 12:21 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Thanks Tingmao for the report.

Could you show us the complete SQL? In your SQL, there is no order by statement. If no ORDER
BY, the query should not be rewritten into TopN measure.

2017-05-12 23:52 GMT+08:00 Tingmao Lin <Z089@msn.com<mailto:Z089@msn.com>>:

Hi,

We found that SUM() query on a cardinality 1 dimension is not accurate (or "not correct")
when automatically  rewritten as TOPN.
Is that the expected behavior of kylin or there are any other issue?

We built a cube on a table ( measure1: bigint, dim1_id:varchar, dim2_id:varchar, ... ) using
kylin 1.6.0 (Kafka streaming source)

The cube has two measures: SUM(measure1) and TOPN(10,sum-orderby(measure1),group by dim2_id)
. (other measures omitted)
and two dimensions  dim1_id, dim2_id   (other dims omitted)

About the source table data:
The cardinality of dim1_id  is 1 (same dim1_id for all rows in the source table)
The cardinality of dim2_id  is 1 (same dim2_id for all rows in the source table)
The possible value of measure1 is [1,0,-1]

When we query
    "select SUM(measure1) FROM table GROUP BY dim2_id"                    =>     the result
has one row:"sum=7",
      from the kylin logs we found that the query has been automatically  rewritten as TOPN(measure1,sum-orderby(measure1),group
by dim2_id)

When we write another query to prevent TOPN rewrite, for example:

   "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>   one row -- "sum=-2,count=24576"

   "select SUM(measure1),count(*) FROM table"                                        =>
  one row -- "sum=-2,count=24576"


The result is different (7 and -2) when rewritting to TOPN or not.


My question is: are the following behavior "works as expected" ,or TOPN algorithm does not
support negative counter values very well , or any issue there?


1. SUM() query  automatically rewritten as TOPN and gives approximated result when no TOPN
present in the query.

2. When cardinality is 1, TOPN does not give accurate result.




Thanks.






--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋

Mime
View raw message