hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Ma (JIRA)" <>
Subject [jira] [Updated] (HIVE-16969) Improvement performance of MapOperator for Parquet
Date Tue, 27 Jun 2017 03:11:00 GMT


Colin Ma updated HIVE-16969:
    Attachment: HIVE-16969.001.patch

With the patch, I test the query13 of TPC-DS in my local cluster, The cluster includes 6 nodes,
128G memory/per node, CPU is Intel(R) Xeon(R) E5-2680, 1G network. With the 10G data scale
and spark as executor engine. The table is stored as Parquet file, and the partition number
of the largest table is 1825. 
The result shows the execution time from {color:red}85s{color} to {color:#14892c}71s{color},
and the initial time of MapOperator from {color:red}15s{color} to {color:#14892c}less than

> Improvement performance of MapOperator for Parquet
> --------------------------------------------------
>                 Key: HIVE-16969
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Colin Ma
>            Assignee: Colin Ma
>             Fix For: 3.0.0
>         Attachments: HIVE-16969.001.patch
> For a table with many partition files, MapOperator.cloneConfsForNestedColPruning() will
update the many times. The larger value of
will cause the poor performance for ParquetHiveSerDe.processRawPrunedPaths(). 
> So, the unnecessary paths should be appended to

This message was sent by Atlassian JIRA

View raw message