kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ShaoFeng Shi <shaofeng...@apache.org>
Subject Re: Re: Scan row count exceeded threshold
Date Mon, 08 Aug 2016 10:25:03 GMT
Cool, you get it :-)

2016-08-08 11:30 GMT+08:00 赵天烁 <zhaotianshuo@meizu.com>:

> already sovled by change those two date dims with UHC into yyMMdd formated
> integer ,trim  HHmmss, which dramatically reduce the cardinality,then put
> mandatory filter dim at the begining of row key.thx
>
> ------------------------------
>
> 赵天烁
>
> Kevin Zhao
>
> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>
>
>
> 珠海市魅族科技有限公司
>
> MEIZU Technology Co., Ltd.
>
> 广东省珠海市科技创新海岸魅族科技楼
>
> MEIZU Tech Bldg., Technology & Innovation Coast
>
> Zhuhai, 519085, Guangdong, China
>
> meizu.com
>
>
> *From:* ShaoFeng Shi <shaofengshi@apache.org>
> *Date:* 2016-08-08 11:23
> *To:* user <user@kylin.apache.org>
> *Subject:* Re: Re: Scan row count exceeded threshold
> "two of then cardinality are bigger than 100milion"
>
> ==> You must do optimization for your cube when having UHC dimensions;
> otherwise its expansion rate may be very high.
>
> 2016-08-06 12:20 GMT+08:00 张天生 <zhtsh.lichao@gmail.com>:
>
>> I also need to adjust threshold to bigger value. We set JVM -Xmx to 32G,
>> but it was not occupied enough.
>>
>> 赵天烁 <zhaotianshuo@meizu.com>于2016年8月4日周四 上午11:54写道:
>>
>>> I dig into the code, found that this threshold is determine at
>>> CubeStorageQuery--setThreshold method, there already has a prop call
>>> kylin.query.mem.budget,which define the memory size of current jvm
>>> obtain,default to 3G then divide this by a estimate size of a single
>>> row.the result of that is the threshold.so I think by change kylin.query.mem.budget
>>> to a higher val will do.
>>>
>>> btw my cube is growing like a hell.....the biggest one is almost
>>> 1.7TB(only keep on month of these data),source record is 200 milion level,9
>>> dim,two of then cardinality are bigger than 100milion,already have
>>> performance issue,maybe you kylin expert could provide some optimize sugg?
>>> waiting......
>>>
>>> ------------------------------
>>>
>>> 赵天烁
>>>
>>> Kevin Zhao
>>>
>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>
>>>
>>>
>>> 珠海市魅族科技有限公司
>>>
>>> MEIZU Technology Co., Ltd.
>>>
>>> 广东省珠海市科技创新海岸魅族科技楼
>>>
>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>
>>> Zhuhai, 519085, Guangdong, China
>>>
>>> meizu.com
>>>
>>>
>>> *From:* hongbin ma <mahongbin@apache.org>
>>> *Date:* 2016-08-03 12:50
>>> *To:* user.kylin <user@kylin.apache.org>
>>>
>>> *Subject:* Re: Re: Scan row count exceeded threshold
>>>
>>> Hi Tiansheng and Tianshuo
>>>
>>> Kylin is basically limiting the footprint on storage visit, the size of
>>> each row is inversely proportional to the number of rows to be read. The
>>> cuboid row size grows when there's a distinct count measure, so you're
>>> observing  threshold being 49121, this is normal.
>>>
>>> The unnormal part is kylins behavior when there's a limit clause,
>>> especially for cases like Tianshuo's case, where query being:
>>>
>>> select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by
>>> FCRASH_ANALYSIS_ID,FCRASHTIME limit N
>>>
>>> The query does not have any filters, so we should be able to read the
>>> first N rows from cuboid (FCRASH_ANALYSIS_ID,FCRASHTIME) and return the
>>> result to users. Yang Li tried to fix the issue in
>>> https://issues.apache.org/jira/browse/KYLIN-1787, however the approach
>>> was still a little bit too conservative to me. The patch in KYLIN-1787
>>> would not enable the storage read limit as long as the cube has a partition
>>> time column (and meanwhile the query is not grouping by the partition time
>>> column), because we'll need to further aggregate rows from different
>>> segments. This is why 1.5.3 does not behave as Tianshuo expect.
>>>
>>> However there's still room for improvement even if further aggregation
>>> is required across multiple segments. For tianshuo's case, we can ask for N
>>> cuboid row from each segment, and merge them at query server side. Since
>>> the cuboid rows are respectively sorted in each segment, it
>>> is guaranteed that the result is correct
>>>
>>> However it's a different story if the query contains filters, like in
>>> Tiansheng's case. Filter on dimensions may prevent limit clause put down,
>>> especially when the dimension is not the first dimension in row key. Below
>>> is Tiansheng's case:
>>>
>>> Error while executing SQL "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS)
>>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE))
>>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN
>>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05'
>>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps desc
>>> limit 10 offset 0": Scan row count exceeded threshold: 49121, please add
>>> filter condition to narrow down backend scan range, like where clause.
>>>
>>> I have opened a JIRA to fix this
>>>
>>> On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <zhtsh.lichao@gmail.com> wrote:
>>>
>>>> Hi ShaoFeng:
>>>>
>>>> My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS)
>>>> as imps,SUM(CLKS) as clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE))
>>>> as cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN
>>>> where COMPANYID='296' and "DATE">='2016-01-01' and "DATE"<'2016-01-05'
>>>> group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps
>>>> desc limit 10 offset 0". I analyzed the critical reason is "COUNT(DISTINCT
>>>> CLK_DEVICEID) as clk_uv", when i delete this measure, it queried fine even
>>>> for a long time span query.
>>>>
>>>> 赵天烁 <zhaotianshuo@meizu.com>于2016年8月3日周三 上午10:16写道:
>>>>
>>>>> ok,I'll try to make it configurable in kylin.properties
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> 赵天烁
>>>>>
>>>>> Kevin Zhao
>>>>>
>>>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>>>
>>>>>
>>>>>
>>>>> 珠海市魅族科技有限公司
>>>>>
>>>>> MEIZU Technology Co., Ltd.
>>>>>
>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>
>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>
>>>>> Zhuhai, 519085, Guangdong, China
>>>>>
>>>>> meizu.com
>>>>>
>>>>>
>>>>> *From:* ShaoFeng Shi <shaofengshi@apache.org>
>>>>> *Date:* 2016-08-02 22:37
>>>>> *To:* user <user@kylin.apache.org>
>>>>> *Subject:* Re: Re: Scan row count exceeded threshold
>>>>>
>>>>> This is a new change introduced in 1.5.3, JIRA is:
>>>>> https://issues.apache.org/jira/browse/KYLIN-1787
>>>>>
>>>>> In a short, it tries to avoid scanning too many rows (1 million) in
>>>>> one query. But the threshold wasn't externalized as a parameter. @Tianshuo,
>>>>> would you mind to contribute a patch?
>>>>>
>>>>> Besides, the query has "limit 1" but Kylin didn't smartly aware that.
>>>>>
>>>>> @Tian sheng, what's the SQL query looks like in your case?
>>>>>
>>>>> 2016-08-02 21:14 GMT+08:00 张天生 <zhtsh.lichao@gmail.com>:
>>>>>
>>>>>> I meet the same question. In 1.5.2.1 version it worked fine, but
>>>>>> switched 1.5.3 it ofter report this error.
>>>>>>
>>>>>> 赵天烁 <zhaotianshuo@meizu.com>于2016年8月2日周二
上午11:57写道:
>>>>>>
>>>>>>> yes, it is 1.5.3,after limit the date range to a week,it still
cost
>>>>>>> about 7 second to response, is there any way to get the query
execute
>>>>>>> detail just like the diagnosis log?
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> 赵天烁
>>>>>>>
>>>>>>> Kevin Zhao
>>>>>>>
>>>>>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 珠海市魅族科技有限公司
>>>>>>>
>>>>>>> MEIZU Technology Co., Ltd.
>>>>>>>
>>>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>>>
>>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>>>
>>>>>>> Zhuhai, 519085, Guangdong, China
>>>>>>>
>>>>>>> meizu.com
>>>>>>>
>>>>>>>
>>>>>>> *From:* ShaoFeng Shi <shaofengshi@apache.org>
>>>>>>> *Date:* 2016-08-01 20:20
>>>>>>> *To:* user <user@kylin.apache.org>
>>>>>>> *Subject:* Re: Scan row count exceeded threshold
>>>>>>>
>>>>>>> are you on 1.5.3? If not, suggest to upgrade and then retry it.
>>>>>>> Besides, you can add some "where" contition to reduce the records
to scan,
>>>>>>> as the error message mentioned.
>>>>>>>
>>>>>>> 2016-08-01 15:37 GMT+08:00 赵天烁 <zhaotianshuo@meizu.com>:
>>>>>>>
>>>>>>>> recently I got the following error while execute query on
a cube
>>>>>>>> which is not that big( about 400mb, 20milion record)
>>>>>>>> ==================
>>>>>>>> Error while executing SQL "select FCRASHTIME,count(1) from
>>>>>>>> UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME
limit
>>>>>>>> 1": Scan row count exceeded threshold: 10000000, please add
filter
>>>>>>>> condition to narrow down backend scan range, like where clause.
>>>>>>>>
>>>>>>>> I guess what  it scan were the intermediate result, but It
doesn't
>>>>>>>> any order by,also the result count is limit to just 1.so
it could
>>>>>>>> scan to find any record with those two dimension and wala.
>>>>>>>> waiting kylin expert to give me some detail on that.
>>>>>>>> btw how to configure that threshold, any properties in
>>>>>>>> kylin.properties?
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> 赵天烁
>>>>>>>>
>>>>>>>> Kevin Zhao
>>>>>>>>
>>>>>>>> *zhaotianshuo@meizu.com <zhaotianshuo@meizu.com>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 珠海市魅族科技有限公司
>>>>>>>>
>>>>>>>> MEIZU Technology Co., Ltd.
>>>>>>>>
>>>>>>>> 广东省珠海市科技创新海岸魅族科技楼
>>>>>>>>
>>>>>>>> MEIZU Tech Bldg., Technology & Innovation Coast
>>>>>>>>
>>>>>>>> Zhuhai, 519085, Guangdong, China
>>>>>>>>
>>>>>>>> meizu.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Shaofeng Shi
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>>
>>>>> Shaofeng Shi
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> *Bin Mahone | 马洪宾*
>>>
>>>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>


-- 
Best regards,

Shaofeng Shi

Mime
View raw message