From issues-return-201393-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon Sep 10 04:25:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C1448180671 for ; Mon, 10 Sep 2018 04:25:04 +0200 (CEST) Received: (qmail 86346 invoked by uid 500); 10 Sep 2018 02:25:03 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 86337 invoked by uid 99); 10 Sep 2018 02:25:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2018 02:25:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 260AA18C6E6 for ; Mon, 10 Sep 2018 02:25:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 3uE9p_d5AyG5 for ; Mon, 10 Sep 2018 02:25:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9F1C45F3B2 for ; Mon, 10 Sep 2018 02:25:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C04A4E0E87 for ; Mon, 10 Sep 2018 02:25:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5960F26B56 for ; Mon, 10 Sep 2018 02:25:00 +0000 (UTC) Date: Mon, 10 Sep 2018 02:25:00 +0000 (UTC) From: "Dongjoon Hyun (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25175: ------------------------------------- Assignee: Chenxiao Mao > Field resolution should fail if there's ambiguity for ORC native reader > ----------------------------------------------------------------------- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: Chenxiao Mao > Assignee: Chenxiao Mao > Priority: Major > Fix For: 2.4.0, 3.0.0 > > > SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues, but not identical to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the first matched field, rather than failing the reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data file has more fields than table schema, we just can't read hive serde tables. If ORC data file does not have more fields, hive serde tables always do field resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc InputFormat/SerDe to read table. I'm not sure whether we can change underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org