From dev-return-145934-archive-asf-public=cust-asf.ponee.io@hive.apache.org Mon Jan 29 21:59:10 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 12B90180654 for ; Mon, 29 Jan 2018 21:59:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 022FA160C31; Mon, 29 Jan 2018 20:59:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4940D160C2F for ; Mon, 29 Jan 2018 21:59:09 +0100 (CET) Received: (qmail 94153 invoked by uid 500); 29 Jan 2018 20:59:03 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 94142 invoked by uid 99); 29 Jan 2018 20:59:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jan 2018 20:59:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A47601A32CE for ; Mon, 29 Jan 2018 20:59:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -102.311 X-Spam-Level: X-Spam-Status: No, score=-102.311 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id a-_BRUsDkM54 for ; Mon, 29 Jan 2018 20:59:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 31F0D5FB41 for ; Mon, 29 Jan 2018 20:59:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id A26ACE01AB for ; Mon, 29 Jan 2018 20:59:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0DA80240F2 for ; Mon, 29 Jan 2018 20:59:00 +0000 (UTC) Date: Mon, 29 Jan 2018 20:59:00 +0000 (UTC) From: "gurmukh singh (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-18572) The record readers; InputFormat needs to be fixed for Tez as it generates 1 split MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 gurmukh singh created HIVE-18572: ------------------------------------ Summary: The record readers; InputFormat needs to be fixed for= Tez as it generates 1 split Key: HIVE-18572 URL: https://issues.apache.org/jira/browse/HIVE-18572 Project: Hive Issue Type: Bug Affects Versions: 2.1.0 Reporter: gurmukh singh The record reader needs to be fixed for tez, as it generates only 1 split d= ue to the {color:#333333}MRv2 CombineInputFormat broke that rule{color}. This has been fixed in MR but not Tez. I am seeing a strange behaviour in tez; it is seeing all data as a single s= plit under hive, where as MR see all 79 files. This is causing all the data= to go to a single map TEZ Processing INFO=E2=80=82=E2=80=82: Partition trusted.usage\{ds=3D20180126, periode=3D1= 200} stats: [numFiles=3D1, numRows=3D79575067, totalSize=3D3.164.605.993, r= awDataSize=3D112439569671] ELAPSED TIME: 1958.99 s MR Processing Partition trusted.usage\{ds=3D20180126, periode=3D1200} stats: [numFiles=3D= 79, numRows=3D79575067, totalSize=3D3172280778, rawDataSize=3D112418416260] ELAPSED TIME: 65 s Log Tez 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMap= redSplitsGrouper|: Desired splits: 381 too large.=E2=80=82=E2=80=82Desired = splitLength: 8311476 Min splitLength: 50331648 New desired splits: 381 Fina= l desired splits: 381 All splits have localhost: false Total length: 191662= 65870 Original splits: 1 2018-01-29 16:50:04,825 [INFO] [InputInitializer \{Map 1} #0] |split.TezMap= redSplitsGrouper|: Using original number of splits: 1 desired splits: 381 2018-01-29 16:50:04,826 [INFO] [InputInitializer \{Map 1} #0] |tez.SplitGro= uper|: Original split size is 1 grouped split size is 1, for bucket: 1 2018-01-29 16:50:04,827 [INFO] [InputInitializer \{Map 1} #0] |tez.HiveSpli= tGenerator|: Number of grouped splits: 1 2018-01-29 16:50:04,846 [INFO] [InputInitializer \{Map 1} #0] |dag.RootInpu= tInitializerManager|: Succeeded InputInitializer for Input: usage on vertex= vertex_1517207496169_0085_1_00 [Map 1] 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Ca= nnot init vertex: vertex_1517207496169_0085_1_00 [Map 1] numTasks: -1 numUn= itializedEdges: 0 numInitializedInputs: 1 initWaitsForRootInitializers: tru= e 2018-01-29 16:50:04,848 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Go= t updated RootInputsSpecs: \{usage=3DforAllWorkUnits=3Dtrue, update=3D[1]} 2018-01-29 16:50:04,859 [INFO] [App Shared Pool - #0] |impl.VertexImpl|: Ve= rtex vertex_1517207496169_0085_1_00 [Map 1] parallelism set to 1 As per discussion with Gopal Vijayaraghavan: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hi= ve/ql/io/HiveInputFormat.java#L494 =C2=A0that line, right there MRv2 CombineInputFormat broke that rule, so th= e record readers had to be fixed to handle it https://github.com/apache/hiv= e/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareReco= rdReader.java#L312 -- This message was sent by Atlassian JIRA (v7.6.3#76005)