hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chaozhong Yang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-16972) FetchOperator: filter out inputSplits which length is zero
Date Tue, 27 Jun 2017 08:05:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-16972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chaozhong Yang updated HIVE-16972:
----------------------------------
    Attachment: screenshot-1.png

> FetchOperator: filter out inputSplits which length is zero
> ----------------------------------------------------------
>
>                 Key: HIVE-16972
>                 URL: https://issues.apache.org/jira/browse/HIVE-16972
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2, Physical Optimizer, Query Planning
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: Chaozhong Yang
>            Assignee: Chaozhong Yang
>             Fix For: 2.1.2
>
>         Attachments: HIVE-16972.patch, screenshot-1.png
>
>
> * Background
>    We can describe the basic work flow of  common HQL query as follows:
>   1. compile and execute
>   2. fetch results
>   In many cases, we don't need to  worry about the issues fetching results from HDFS(iff
there are mapreduce jobs generated in planning step). However, the number of results files
on HDFS and data distribution will affect the final status of HQL query, especially for HiveServer2.
We have some map-only queries, e.g: 
> {code:sql}
> select * from myTable where date > '20170201' and date <= '20170301' and id = 88;
> {code}
>     This query will generate more than 10,000 files on HDFS and most of those files are
empty. Of course, they are very sparse. If we send TFetchResultsRequest from HiveServer2 client
with  some parameters(timeout: 90s, maxRows: 1024) , FetchOperator can not fetch 1024 rows
in 90 seconds and our HiveServer2 client will mark this TFetchResultsRequest as timed out
failure. Why? In fact, It's expensive to fetch results from empty file. In our HDFS cluster(
5000+ DataNodes) , reading data from an empty file will cost almost 100 ms (100ms * 1000 ==>
100s > 90s timeout). Obviously, we can filter out those empty files or splits to speed
up the process of FetchResults. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message