kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hongbin ma <mahong...@apache.org>
Subject Re: Re: Scan row count exceeded threshold
Date Wed, 03 Aug 2016 04:42:42 GMT
Hi Tiansheng and Tianshuo

Kylin is basically limiting the footprint on storage visit, the size of
each row is inversely proportional to the number of rows to be read. The
cuboid row size grows when there's a distinct count measure, so you're
observing  threshold being 49121, this is normal.

The unnormal part is kylins behavior when there's a limit clause,
especially for cases like Tianshuo's case, where query being:

select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by
FCRASH_ANALYSIS_ID,FCRASHTIME limit N

The query does not have any filters, so we should be able to read the first
N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the result to
users. Yang Li tried to fix the issue in
https://issues.apache.org/jira/browse/KYLIN-1787, however the approach was
still a little bit too conservative to me. The patch in KYLIN-1787 would
not enable the storage read limit as long as the cube has a partition time
column (and meanwhile the query is not grouping by the partition time
column), because we'll need to further aggregate rows from different
segments. This is why 1.5.3 does not behave as Tianshuo expect.

However there's still room for improvement even if further aggregation is
required across multiple segments. For tianshuo's case, we can ask for N
cuboid row from each segment, and merge them at query server side. Since
the cuboid rows are respectively sorted in each segment, it
is guaranteed that the result is correct

However it's a different story if the query contains filters, like in
Tiansheng's case. Filter on dimensions may prevent limit clause put down,
especially when the dimension is not the first dimension in row key. Below
is Tiansheng's case:

Error while executing SQL "select
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS) as
clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as
cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where
COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by
"DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10
offset 0": Scan row count exceeded threshold: 49121, please add filter
condition to narrow down backend scan range, like where clause.

I have opened a JIRA to fix this

On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <zhtsh.lichao@gmail.com> wrote:

> Hi ShaoFeng:
>
> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as
> imps,SUM(CLKS) as clks,SUM(CONS) as cons,
> (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as cost,COUNT(DISTINCT
> CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296' and
> "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by
> "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc limit 10
> offset 0". I analyzed the critical reason is "COUNT(DISTINCT
> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even
> for a long time span query.
>
> 赵天烁 <zhaotianshuo@meizu.com>于2016年8月3日周三 上午10:16写道:
>
>> ok,I'll try to make it configurable in kylin.properties
>>
>> ------------------------------
>>
>> 赵天烁
>>
>> Kevin Zhao
>>
>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>
>>
>>
>> 珠海市魅族科技有限公司
>>
>> MEIZU Technology Co., Ltd.
>>
>> 广东省珠海市科技创新海岸魅族科技楼
>>
>> MEIZU Tech Bldg., Technology & Innovation Coast
>>
>> Zhuhai, 519085, Guangdong, China
>>
>> meizu.com
>>
>>
>> *From:* ShaoFeng Shi <shaofengshi@apache.org>
>> *Date:* 2016-08-02 22:37
>> *To:* user <user@kylin.apache.org>
>> *Subject:* Re: Re: Scan row count exceeded threshold
>>
>> This is a new change introduced in 1.5.3, JIRA is:
>> https://issues.apache.org/jira/browse/KYLIN-1787
>>
>> In a short, it tries to avoid scanning too many rows (1 million) in one
>> query. But the threshold wasn't externalized as a parameter. @Tianshuo,
>> would you mind to contribute a patch?
>>
>> Besides, the query has "limit 1" but Kylin didn't smartly aware that.
>>
>> @Tian sheng, what's the SQL query looks like in your case?
>>
>> 2016-08-02 21:14 GMT+08:00 张天生 <zhtsh.lichao@gmail.com>:
>>
>>> I meet the same question. In 1.5.2.1 version it worked fine, but
>>> switched 1.5.3 it ofter report this error.
>>>
>>> 赵天烁 <zhaotianshuo@meizu.com>于2016年8月2日周二 上午11:57写道:
>>>
>>>> yes, it is 1.5.3,after limit the date range to a week,it still cost
>>>> about 7 second to response, is there any way to get the query execute
>>>> detail just like the diagnosis log?
>>>>
>>>> ------------------------------
>>>>
>>>> 赵天烁
>>>>
>>>> Kevin Zhao
>>>>
>>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>>
>>>>
>>>>
>>>> 珠海市魅族科技有限公司
>>>>
>>>> MEIZU Technology Co., Ltd.
>>>>
>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>
>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>
>>>> Zhuhai, 519085, Guangdong, China
>>>>
>>>> meizu.com
>>>>
>>>>
>>>> *From:* ShaoFeng Shi <shaofengshi@apache.org>
>>>> *Date:* 2016-08-01 20:20
>>>> *To:* user <user@kylin.apache.org>
>>>> *Subject:* Re: Scan row count exceeded threshold
>>>>
>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it.
>>>> Besides, you can add some "where" contition to reduce the records to scan,
>>>> as the error message mentioned.
>>>>
>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <zhaotianshuo@meizu.com>:
>>>>
>>>>> recently I got the following error while execute query on a cube which
>>>>> is not that big( about 400mb, 20milion record)
>>>>> ==================
>>>>> Error while executing SQL "select FCRASHTIME,count(1) from
>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME
limit
>>>>> 1": Scan row count exceeded threshold: 10000000, please add filter
>>>>> condition to narrow down backend scan range, like where clause.
>>>>>
>>>>> I guess what  it scan were the intermediate result, but It doesn't any
>>>>> order by,also the result count is limit to just 1.so it could scan to
>>>>> find any record with those two dimension and wala.
>>>>> waiting kylin expert to give me some detail on that.
>>>>> btw how to configure that threshold, any properties in
>>>>> kylin.properties?
>>>>> ------------------------------
>>>>>
>>>>> 赵天烁
>>>>>
>>>>> Kevin Zhao
>>>>>
>>>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>>>
>>>>>
>>>>>
>>>>> 珠海市魅族科技有限公司
>>>>>
>>>>> MEIZU Technology Co., Ltd.
>>>>>
>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>
>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>
>>>>> Zhuhai, 519085, Guangdong, China
>>>>>
>>>>> meizu.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>>
>>>> Shaofeng Shi
>>>>
>>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi
>>
>>


-- 
Regards,

*Bin Mahone | 马洪宾*

Mime
View raw message