Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E6683200CB4 for ; Tue, 27 Jun 2017 09:20:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E52C9160BF9; Tue, 27 Jun 2017 07:20:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 35DE2160BDC for ; Tue, 27 Jun 2017 09:20:04 +0200 (CEST) Received: (qmail 17004 invoked by uid 500); 27 Jun 2017 07:20:03 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 16992 invoked by uid 99); 27 Jun 2017 07:20:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jun 2017 07:20:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 97ECD188A63 for ; Tue, 27 Jun 2017 07:20:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id c9g8PFdpqjVj for ; Tue, 27 Jun 2017 07:20:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1D4095FBC6 for ; Tue, 27 Jun 2017 07:20:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 634F0E0634 for ; Tue, 27 Jun 2017 07:20:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1B1AA2411C for ; Tue, 27 Jun 2017 07:20:00 +0000 (UTC) Date: Tue, 27 Jun 2017 07:20:00 +0000 (UTC) From: "Chaozhong Yang (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-16972) FetchOperator: filter out inputSplits which length is zero MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 27 Jun 2017 07:20:05 -0000 Chaozhong Yang created HIVE-16972: ------------------------------------- Summary: FetchOperator: filter out inputSplits which length is zero Key: HIVE-16972 URL: https://issues.apache.org/jira/browse/HIVE-16972 Project: Hive Issue Type: Improvement Components: HiveServer2, Physical Optimizer, Query Planning Affects Versions: 2.1.1, 2.1.0 Reporter: Chaozhong Yang Assignee: Chaozhong Yang Fix For: 2.1.2 * Background We can describe the basic work flow of common HQL query as follows: 1. compile and execute 2. fetch results In many cases, we don't need to worry about the issues fetching results from HDFS(iff there are mapreduce jobs generated in planning step). However, the number of results files on HDFS and data distribution will affect the final status of HQL query, especially for HiveServer2. We have some map-only queries, e.g: {code:sql} select * from myTable where date > '20170201' and date <= '20170301' and id = 88; {code} This query will generate more than 10,000 files on HDFS and most of those files are empty. Of course, they are very sparse. If we send TFetchResultsRequest from HiveServer2 client with some parameters(timeout: 90s, maxRows: 1024) , FetchOperator can not fetch 1024 rows in 90 seconds and our HiveServer2 client will mark this TFetchResultsRequest as timed out failure. Why? In fact, It's expensive to fetch results from empty file. In our HDFS cluster( 5000+ DataNodes) , reading data from an empty file will cost almost 100 ms (100ms * 1000 ==> 100s > 90s timeout). Obviously, we can filter out those empty files or splits to speed up the process of FetchResults. -- This message was sent by Atlassian JIRA (v6.4.14#64029)