kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 赵天烁 <zhaotians...@meizu.com>
Subject Re: Re: Scan row count exceeded threshold
Date Thu, 04 Aug 2016 03:54:32 GMT
I dig into the code, found that this threshold is determine at CubeStorageQuery--setThreshold
method, there already has a prop call kylin.query.mem.budget,which define the memory size
of current jvm obtain,default to 3G then divide this by a estimate size of a single row.the
result of that is the threshold.so I think by change kylin.query.mem.budget to a higher val
will do.

btw my cube is growing like a hell.....the biggest one is almost 1.7TB(only keep on month
of these data),source record is 200 milion level,9 dim,two of then cardinality are bigger
than 100milion,already have performance issue,maybe you kylin expert could provide some optimize
sugg? waiting......

________________________________
赵天烁
Kevin Zhao
zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com

From: hongbin ma<mailto:mahongbin@apache.org>
Date: 2016-08-03 12:50
To: user.kylin<mailto:user@kylin.apache.org>
Subject: Re: Re: Scan row count exceeded threshold
Hi Tiansheng and Tianshuo

Kylin is basically limiting the footprint on storage visit, the size of each row is inversely
proportional to the number of rows to be read. The cuboid row size grows when there's a distinct
count measure, so you're observing  threshold being 49121, this is normal.

The unnormal part is kylins behavior when there's a limit clause, especially for cases like
Tianshuo's case, where query being:

select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group by FCRASH_ANALYSIS_ID,FCRASHTIME
limit N

The query does not have any filters, so we should be able to read the first N rows from cuboid
(FCRASH_ANALYSIS_ID,FCRASHTIME) and return the result to users. Yang Li tried to fix the issue
in https://issues.apache.org/jira/browse/KYLIN-1787, however the approach was still a little
bit too conservative to me. The patch in KYLIN-1787 would not enable the storage read limit
as long as the cube has a partition time column (and meanwhile the query is not grouping by
the partition time column), because we'll need to further aggregate rows from different segments.
This is why 1.5.3 does not behave as Tianshuo expect.

However there's still room for improvement even if further aggregation is required across
multiple segments. For tianshuo's case, we can ask for N cuboid row from each segment, and
merge them at query server side. Since the cuboid rows are respectively sorted in each segment,
it is guaranteed that the result is correct

However it's a different story if the query contains filters, like in Tiansheng's case. Filter
on dimensions may prevent limit clause put down, especially when the dimension is not the
first dimension in row key. Below is Tiansheng's case:

Error while executing SQL "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as
imps,SUM(CLKS) as clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as
cost,COUNT(DISTINCT CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296'
and "DATE">='2016-01-01' and "DATE"<'2016-01-05' group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID
order by imps desc limit 10 offset 0": Scan row count exceeded threshold: 49121, please add
filter condition to narrow down backend scan range, like where clause.

I have opened a JIRA to fix this

On Wed, Aug 3, 2016 at 10:25 AM, 张天生 <zhtsh.lichao@gmail.com<mailto:zhtsh.lichao@gmail.com>>
wrote:
Hi ShaoFeng:

My sql is "select "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID,SUM(IMPS) as imps,SUM(CLKS)
as clks,SUM(CONS) as cons, (SUM(IMP_CLOSINGPRICE)+SUM(CLK_CLOSINGPRICE)) as cost,COUNT(DISTINCT
CLK_DEVICEID) as clk_uv from EXT_MID_EVENT_JOIN where COMPANYID='296' and "DATE">='2016-01-01'
and "DATE"<'2016-01-05' group by "DATE",ADGROUPID,CAMPAIGNID,COMPANYID,APPID order by imps
desc limit 10 offset 0". I analyzed the critical reason is "COUNT(DISTINCT CLK_DEVICEID) as
clk_uv", when i delete this measure, it queried fine even for a long time span query.

赵天烁 <zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>>于2016年8月3日周三
上午10:16写道:
ok,I'll try to make it configurable in kylin.properties

________________________________
赵天烁
Kevin Zhao
zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>

From: ShaoFeng Shi<mailto:shaofengshi@apache.org>
Date: 2016-08-02 22:37
To: user<mailto:user@kylin.apache.org>
Subject: Re: Re: Scan row count exceeded threshold
This is a new change introduced in 1.5.3, JIRA is: https://issues.apache.org/jira/browse/KYLIN-1787

In a short, it tries to avoid scanning too many rows (1 million) in one query. But the threshold
wasn't externalized as a parameter. @Tianshuo, would you mind to contribute a patch?

Besides, the query has "limit 1" but Kylin didn't smartly aware that.

@Tian sheng, what's the SQL query looks like in your case?

2016-08-02 21:14 GMT+08:00 张天生 <zhtsh.lichao@gmail.com<mailto:zhtsh.lichao@gmail.com>>:
I meet the same question. In 1.5.2.1 version it worked fine, but switched 1.5.3 it ofter report
this error.

赵天烁 <zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>>于2016年8月2日周二
上午11:57写道:
yes, it is 1.5.3,after limit the date range to a week,it still cost about 7 second to response,
is there any way to get the query execute detail just like the diagnosis log?

________________________________
赵天烁
Kevin Zhao
zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>

From: ShaoFeng Shi<mailto:shaofengshi@apache.org>
Date: 2016-08-01 20:20
To: user<mailto:user@kylin.apache.org>
Subject: Re: Scan row count exceeded threshold
are you on 1.5.3? If not, suggest to upgrade and then retry it. Besides, you can add some
"where" contition to reduce the records to scan, as the error message mentioned.

2016-08-01 15:37 GMT+08:00 赵天烁 <zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>>:
recently I got the following error while execute query on a cube which is not that big( about
400mb, 20milion record)
==================
Error while executing SQL "select FCRASHTIME,count(1) from UXIP.EDL_FDT_OUC_UPLOAD_FILES group
by FCRASH_ANALYSIS_ID,FCRASHTIME limit 1": Scan row count exceeded threshold: 10000000, please
add filter condition to narrow down backend scan range, like where clause.

I guess what  it scan were the intermediate result, but It doesn't any order by,also the result
count is limit to just 1.so it could scan to find any record with those two dimension and
wala.
waiting kylin expert to give me some detail on that.
btw how to configure that threshold, any properties in kylin.properties?
________________________________
赵天烁
Kevin Zhao
zhaotianshuo@meizu.com<mailto:zhaotianshuo@meizu.com>

珠海市魅族科技有限公司
MEIZU Technology Co., Ltd.
广东省珠海市科技创新海岸魅族科技楼
MEIZU Tech Bldg., Technology & Innovation Coast
Zhuhai, 519085, Guangdong, China
meizu.com<http://meizu.com>



--
Best regards,

Shaofeng Shi




--
Best regards,

Shaofeng Shi




--
Regards,

Bin Mahone | 马洪宾
Mime
View raw message