Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24A3EE0AD for ; Thu, 10 Jan 2013 18:28:14 +0000 (UTC) Received: (qmail 17605 invoked by uid 500); 10 Jan 2013 18:28:13 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 17430 invoked by uid 500); 10 Jan 2013 18:28:13 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 17418 invoked by uid 500); 10 Jan 2013 18:28:13 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 17415 invoked by uid 99); 10 Jan 2013 18:28:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jan 2013 18:28:13 +0000 Date: Thu, 10 Jan 2013 18:28:13 +0000 (UTC) From: "He Yongqiang (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549870#comment-13549870 ] He Yongqiang commented on HIVE-3874: ------------------------------------ bq. It would be possible to extend the RCFile reader to recognize an ORC file and to have it delegate to the ORC File reader. it will be great to have this support. In this case, what's the fileformat for the partition/table, rcfile, or orcfile? When we did the conversion for old data from sequencefile to rcfile long time ago, it is a big headache handle errors like "unrecognized fileformat or corruption" because there is no interoperability between these two files. The most errors we saw are because the table/partition format does not match the actual data format. two examples: 1. old partition's data is rcfile, new partition's data is in orc format. 2. in one partition, some files are rcfile, and some files are in orc format. > Create a new Optimized Row Columnar file format for Hive > -------------------------------------------------------- > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is required for external indexes. > * there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira