kylin-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vsevolod Ostapenko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KYLIN-3122) Partition elimination algorithm seems to be inefficient and have serious issues with handling date/time ranges, can lead to very slow queries and OOM/Java heap dump conditions
Date Wed, 20 Dec 2017 22:51:00 GMT

     [ https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vsevolod Ostapenko updated KYLIN-3122:
--------------------------------------
    Description: 
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
    "partition_date_column": "A_VL_HOURLY_V.THEDATE",
    "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
    "partition_date_start": 0,
    "partition_date_format": "yyyyMMdd",
    "partition_time_format": "HH",
    "partition_type": "APPEND",
    "partition_condition_builder": "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube segment corresponds
to just one hour.

When a query is issued where both date and hour are specified using equality condition (e.g.
thedate = '20171011' and thehour = '00') Kylin sequentially integrates over all the segment
cubes (hundreds of them) only to skip all except for the one that needs to be scanned (which
can be observed by looking in the logs).
The expectation is that Kylin would use existing info on the partitioning columns (date and
time) and known hierarchical relations between date and time to locate required partition
much more efficiently that linear scan through all the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition pruning and
scanning becomes not very logical, which suggest bugs in the logic.

If condition is on specific date and closed-open range of hours (e.g. thedate = '20171011'
and thehour >= '10' and thehour < '11'), in addition to sequentially scanning all the
cube partitions (as described above), Kylin will scan HBase regions for all the hours from
the starting hour and till the last hour of the day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of memory, causing
JVM heap dump and Kylin server crash.


If condition is on specific date by hour interval is specified as open-closed (e.g. thedate
= '20171011' and thehour > '10' and thehour <= '11'), Kylin will scan all HBase regions
for all the later dates and hours (e.g. from hour 10 and till the most recent hour on the
most recent day).
As the result query execution will dramatically increase and in most cases Kylin server will
be terminated with OOM error and JVM heap dump.

  was:
Current algorithm of cube segment elimination seems to be rather inefficient.
We are using a model where cubes are partitioned by date and time:
"partition_desc": {
    "partition_date_column": "A_VL_HOURLY_V.THEDATE",
    "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
    "partition_date_start": 0,
    "partition_date_format": "yyyyMMdd",
    "partition_time_format": "HH",
    "partition_type": "APPEND",
    "partition_condition_builder": "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
  },

Cubes contain partitions for multiple days and 24 hours for each day. Each cube segment corresponds
to just one hour.

When a query is issued where both date and hour are specified using equality condition (e.g.
thedate = '20171011' and thehour = '00') Kylin sequentially integrates over all the segment
cubes (hundreds of them) only to skip all except for the one that needs to be scanned (which
can be observed by looking in the logs).
The expectation is that Kylin would use existing info on the partitioning columns (date and
time) and known hierarchical relations between date and time to locate required partition
much more efficiently that linear scan through all the cube partitions.

Now, if filtering condition is on the range of hours, behavior of the partition pruning and
scanning becomes not very logical, which suggest bugs in the logic.

If condition is on specific date and closed-open range of hours (e.g. thedate = '20171011'
and thehour >= '10' and thehour < '11'), in addition to sequentially scanning all the
cube partitions (as described above), Kylin will scan HBase regions for all the hours from
the starting hour and till the last hour of the day (e.g. from hour 10 to 24).
As the result query will run much longer that necessary, and might run out of memory.


If condition is on specific date by hour interval is specified as open-closed (e.g. thedate
= '20171011' and thehour > '10' and thehour <= '11'), Kylin will scan all HBase regions
for all the later dates and hours (e.g. from hour 10 and till the most recent hour on the
most recent day).
As the result query execution will dramatically increase and in most cases Kylin server will
be terminated with OOM error and JVM heap dump.


> Partition elimination algorithm seems to be inefficient and have serious issues with
handling date/time ranges, can lead to very slow queries and OOM/Java heap dump conditions
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-3122
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3122
>             Project: Kylin
>          Issue Type: Bug
>          Components: Storage - HBase
>    Affects Versions: v2.2.0
>         Environment: HDP 2.5.6, Kylin 2.2.0
>            Reporter: Vsevolod Ostapenko
>            Assignee: hongbin ma
>
> Current algorithm of cube segment elimination seems to be rather inefficient.
> We are using a model where cubes are partitioned by date and time:
> "partition_desc": {
>     "partition_date_column": "A_VL_HOURLY_V.THEDATE",
>     "partition_time_column": "A_VL_HOURLY_V.THEHOUR",
>     "partition_date_start": 0,
>     "partition_date_format": "yyyyMMdd",
>     "partition_time_format": "HH",
>     "partition_type": "APPEND",
>     "partition_condition_builder": "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder"
>   },
> Cubes contain partitions for multiple days and 24 hours for each day. Each cube segment
corresponds to just one hour.
> When a query is issued where both date and hour are specified using equality condition
(e.g. thedate = '20171011' and thehour = '00') Kylin sequentially integrates over all the
segment cubes (hundreds of them) only to skip all except for the one that needs to be scanned
(which can be observed by looking in the logs).
> The expectation is that Kylin would use existing info on the partitioning columns (date
and time) and known hierarchical relations between date and time to locate required partition
much more efficiently that linear scan through all the cube partitions.
> Now, if filtering condition is on the range of hours, behavior of the partition pruning
and scanning becomes not very logical, which suggest bugs in the logic.
> If condition is on specific date and closed-open range of hours (e.g. thedate = '20171011'
and thehour >= '10' and thehour < '11'), in addition to sequentially scanning all the
cube partitions (as described above), Kylin will scan HBase regions for all the hours from
the starting hour and till the last hour of the day (e.g. from hour 10 to 24).
> As the result query will run much longer that necessary, and might run out of memory,
causing JVM heap dump and Kylin server crash.
> If condition is on specific date by hour interval is specified as open-closed (e.g. thedate
= '20171011' and thehour > '10' and thehour <= '11'), Kylin will scan all HBase regions
for all the later dates and hours (e.g. from hour 10 and till the most recent hour on the
most recent day).
> As the result query execution will dramatically increase and in most cases Kylin server
will be terminated with OOM error and JVM heap dump.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message