From issues-return-103640-archive-asf-public=cust-asf.ponee.io@hive.apache.org Thu Feb 1 03:19:04 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 9AB43180662 for ; Thu, 1 Feb 2018 03:19:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8A9C2160C56; Thu, 1 Feb 2018 02:19:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D36DF160C42 for ; Thu, 1 Feb 2018 03:19:03 +0100 (CET) Received: (qmail 85355 invoked by uid 500); 1 Feb 2018 02:19:03 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 85340 invoked by uid 99); 1 Feb 2018 02:19:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Feb 2018 02:19:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 77A8D19882A for ; Thu, 1 Feb 2018 02:19:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9Nidh0sWYp29 for ; Thu, 1 Feb 2018 02:19:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 578105F3B5 for ; Thu, 1 Feb 2018 02:19:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9C9E6E01A7 for ; Thu, 1 Feb 2018 02:19:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 551C021E84 for ; Thu, 1 Feb 2018 02:19:00 +0000 (UTC) Date: Thu, 1 Feb 2018 02:19:00 +0000 (UTC) From: "Deepak Jaiswal (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-18350) load data should rename files consistent with insert statements MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak Jaiswal updated HIVE-18350: ---------------------------------- Attachment: HIVE-18350.7.patch > load data should rename files consistent with insert statements > --------------------------------------------------------------- > > Key: HIVE-18350 > URL: https://issues.apache.org/jira/browse/HIVE-18350 > Project: Hive > Issue Type: Bug > Reporter: Deepak Jaiswal > Assignee: Deepak Jaiswal > Priority: Major > Attachments: HIVE-18350.1.patch, HIVE-18350.2.patch, HIVE-18350.3.patch, HIVE-18350.4.patch, HIVE-18350.5.patch, HIVE-18350.6.patch, HIVE-18350.7.patch > > > Insert statements create files of format ending with 0000_0, 0001_0 etc. However, the load data uses the input file name. That results in inconsistent naming convention which makes SMB joins difficult in some scenarios and may cause trouble for other types of queries in future. > We need consistent naming convention. > For non-bucketed table, hive renames all the files regardless of how they were named by the user. > For bucketed table, hive relies on user to name the files matching the bucket in non-strict mode. Hive assumes that the data belongs to same bucket in a file. In strict mode, loading bucketed table is disabled. > This will likely affect most of the tests which load data which is pretty significant due to which it is further divided into two subtasks for smoother merge. > For existing tables in customer database, it is recommended to reload bucketed tables otherwise if customer tries to run SMB join and there is a bucket for which there is no split, then there is a possibility of getting incorrect results. However, this is not a regression as it would happen even without the patch. > With this patch however, and reloading data, the results should be correct. > For non-bucketed tables and external tables, there is no difference in behavior and reloading data is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)