drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chunhui Shi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4982) Hive Queries degrade when queries switch between different formats
Date Sun, 30 Oct 2016 08:04:58 GMT
Chunhui Shi created DRILL-4982:
----------------------------------

             Summary: Hive Queries degrade when queries switch between different formats
                 Key: DRILL-4982
                 URL: https://issues.apache.org/jira/browse/DRILL-4982
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Chunhui Shi
            Assignee: Chunhui Shi
            Priority: Critical


We have seen degraded performance by doing these steps:
1) generate the repro data:

python script repro.py as below:

import string
import random
 
for i in range(30000000):
    x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
    x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
    x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
    x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
    x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))
    x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randrange(19,
27)))

    print "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6)


python repro.py > repro.csv

2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under hive prompt,
use the following sql command to create an external table:
CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, `id3` string, `id4`
string, `id5` string, `id6` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE LOCATION '/tmp/hiveworkspace/plain'

3) create Hive's table of ORC|PARQUET format:
CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT id1,id2,id3,id4,id5,id6 from
`hiveworkspace`.`plain`;
CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT id1,id2,id3,id4,id5,id6
from `hiveworkspace`.`plain`;

4) Query switch between these two tables, then the query time on the same table significantly
lengthened. On my setup, for ORC, it was 15sec -> 26secs. Queries on table of other formats,
after injecting a query to other formats, all have significant slow down.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message