drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4982) Hive Queries degrade when queries switch between different formats
Date Thu, 08 Dec 2016 05:41:59 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731187#comment-15731187
] 

ASF GitHub Bot commented on DRILL-4982:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/638#discussion_r91448234
  
    --- Diff: contrib/storage-hive/core/src/main/codegen/data/HiveFormats.tdd ---
    @@ -0,0 +1,50 @@
    +# Licensed to the Apache Software Foundation (ASF) under one
    +# or more contributor license agreements.  See the NOTICE file
    +# distributed with this work for additional information
    +# regarding copyright ownership.  The ASF licenses this file
    +# to you under the Apache License, Version 2.0 (the
    +# "License"); you may not use this file except in compliance
    +# with the License.  You may obtain a copy of the License at
    +#
    +# http:# www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +
    +{
    --- End diff --
    
    Can we explain this a bit? We have 6 reader types. But, the only difference in generated
code is has header/footer or not. Can we solve the Java optimization problem with a "classic"
type hierarchy:
    
    ```
    HiveAbstractReader
    . HiveSimpleReader
    . . HiveAvroReader
    . . ...
    . HiveHeaderFooterReader
    . . HiveTextReader
    . . ...
    ```
    
    Names are just made up. The point is, can a much simpler Java hierarchy, with less duplicated
code, solve the problem? If there is one function that is sub-optimized, can just that one
function be generated in the subclass rather than generating duplicate code?


> Hive Queries degrade when queries switch between different formats
> ------------------------------------------------------------------
>
>                 Key: DRILL-4982
>                 URL: https://issues.apache.org/jira/browse/DRILL-4982
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Chunhui Shi
>            Assignee: Karthikeyan Manivannan
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> We have seen degraded performance by doing these steps:
> 1) generate the repro data:
> python script repro.py as below:
> import string
> import random
>  
> for i in range(30000000):
>     x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
>     print "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6)
> python repro.py > repro.csv
> 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt,
use the following sql command to create an external table:
> CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string,
`id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'
> 3) create Hive's table of ORC|PARQUET format:
> CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6
from `hiveworkspace`.`plain`;
> CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6
from `hiveworkspace`.`plain`;
> 4) Query switch between these two tables, then the query time on the same table significantly
lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats,
after injecting a query to other formats, all have significant slow down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message