kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Victor <>
Subject Re: KYLIN-1656
Date Wed, 03 Aug 2016 01:56:31 GMT
Hey ShaoFeng,

My table is not a view. It is partitioned on day and action of the event
and bucketed as well with bucket count four.
The property hive.compute.query.using.stats is also true.
Here is the query for one half hour interval. This step will not be
instantaneous as it needs to filter out data for a half hour interval.

SELECT count(*)
(action_type in ('a', 'o', 'd'))
(((EVENTS.EVENT_DATE = '2016-06-01' AND EVENTS.EVENT_TIME >= '00:00:00') OR
(EVENTS.EVENT_DATE > '2016-06-01')) AND ((EVENTS.EVENT_DATE = '2016-06-01'
AND EVENTS.EVENT_TIME < '00:30:00') OR (EVENTS.EVENT_DATE < '2016-06-01')))

Usually if the query had a where clause of where day='2016-06-01' AND
action_type IN ('a', 'o','d') or if the query was select count(*) from the count step would not even spawn a map reduce job because
it would fetch all this data from metastore.

Also please correct me if I am wrong but going through the JIRA and then
investigating the build job I think this is a potential bug since the count
should run on the intermediate table and not on the source table which is
not happening in the case of my build jobs.
The count step running on intermediate table would be instantaneous with
hive.compute.query.using.stats enabled because only
select count(*) from intermediate_table
would run with no complex filter clauses.

Also even for daily refreshes it has complex where clauses which contribute
to the count step.

I looked into the count step a bit more. Here are some relevant stats.
I already have days worth of data in the partition and I am refreshing for
one half hour interval.

No bytes read: 23172943556
No of input records: 486,062,421
No of output records: 9,000,465

The count job consisted of 12 mappers and 1 reducer. The average mapper
time was around 1 minutes 35 second.

@ShaoFeng For starters we should think about making this feature optional
(A per cube config until we find a better solution). This feature is useful
when data sets are really large, the size of the distinct count step
contributes much more to the build latency and also where the count step is
not complex.

To add to that the size of the table created and the MR URL for Create
Intermediate Flat Hive Table has disappeared. Screenshot for reference


On Wed, Aug 3, 2016 at 6:22 AM, ShaoFeng Shi <> wrote:

> Hi Joel,
> I'm also curious about how the new count step took 6 minutes, which is
> about 1/3 of the total time; if your fact table a Hive View? or it is not
> partitioned? Usually that step should be much faster than others;
> Thanks.
> 2016-08-02 22:58 GMT+08:00 ShaoFeng Shi <>:
>> Hi Joel,
>> I see your point, but I don't have a perfect idea so far; We can discuss
>> here, any comment is welcomed.
>> 2016-08-02 18:43 GMT+08:00 Joel Victor <>:
>>> Is there any way to disable this new step that has been added to the
>>> build process.
>>> This adds a new step which counts the number of records at the beginning
>>> of each build. For my cube builds it does not benefit me much since my
>>> build latencies have gone up from 11 minutes to 18 minutes where
>>> approximately 6 minutes is taken up by this new count step.
>>> My extract fact table distinct column used to take ~3 minutes and now
>>> takes ~2 but at the price of a 6 minute increased latency.
>>> Going through patch I don't see anyway to disable it. Please let me know
>>> if there is any.
>>> NOTE: My cube builds are for every 30 minutes of data which is a very
>>> small interval and also has less amount of data. It seems this count step
>>> isn't very beneficial when the data is small.
>>> Thanks,
>>> -Joel
>> --
>> Best regards,
>> Shaofeng Shi
> --
> Best regards,
> Shaofeng Shi

View raw message