Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 2 Nov 2017 22:27:00 +0000 (UTC)
From: "Sergey Shelukhin (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13100002.1504660935000.137262.1509661620149@Atlassian.JIRA>
In-Reply-To: <JIRA.13100002.1504660935000@Atlassian.JIRA>
References: <JIRA.13100002.1504660935000@Atlassian.JIRA> <JIRA.13100002.1504660935915@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HIVE-17458) VectorizedOrcAcidRowBatchReader
 doesn't handle 'original' files
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 02 Nov 2017 22:27:04 -0000


    [ https://issues.apache.org/jira/browse/HIVE-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236704#comment-16236704 ] 

Sergey Shelukhin commented on HIVE-17458:
-----------------------------------------

Left some comments. My main 2 qs are 
1) A patch mentions that non-split-update ACID cannot be read in Hive3. Wouldn't that mean all the legacy ACID data cannot be read? Reader compat should still be possible.
2) If there are originals only with no deltas, does it still activate the row id machinery? Looks like it should be unnecessary.

> VectorizedOrcAcidRowBatchReader doesn't handle 'original' files
> ---------------------------------------------------------------
>
>                 Key: HIVE-17458
>                 URL: https://issues.apache.org/jira/browse/HIVE-17458
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-17458.01.patch, HIVE-17458.02.patch, HIVE-17458.03.patch, HIVE-17458.04.patch, HIVE-17458.05.patch, HIVE-17458.06.patch, HIVE-17458.07.patch, HIVE-17458.07.patch, HIVE-17458.08.patch, HIVE-17458.09.patch, HIVE-17458.10.patch, HIVE-17458.11.patch, HIVE-17458.12.patch, HIVE-17458.12.patch, HIVE-17458.13.patch, HIVE-17458.14.patch, HIVE-17458.15.patch
>
>
> VectorizedOrcAcidRowBatchReader will not be used for original files.  This will likely look like a perf regression when converting a table from non-acid to acid until it runs through a major compaction.
> With Load Data support, if large files are added via Load Data, the read ops will not vectorize until major compaction.  
> There is no reason why this should be the case.  Just like OrcRawRecordMerger, VectorizedOrcAcidRowBatchReader can look at the other files in the logical tranche/bucket and calculate the offset for the RowBatch of the split.  (Presumably getRecordReader().getRowNumber() works the same in vector mode).
> In this case we don't even need OrcSplit.isOriginal() - the reader can infer it from file path... which in particular simplifies OrcInputFormat.determineSplitStrategies()


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)