hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-17458) VectorizedOrcAcidRowBatchReader doesn't handle 'original' files
Date Wed, 06 Sep 2017 01:38:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman updated HIVE-17458:
----------------------------------
    Description: 
VectorizedOrcAcidRowBatchReader will not be used for original files.  This will likely look
like a perf regression when converting a table from non-acid to acid until it runs through
a major compaction.

With Load Data support, if large files are added via Load Data, the read ops will not vectorize
until major compaction.  

There is no reason why this should be the case.  Just like OrcRawRecordMerger, VectorizedOrcAcidRowBatchReader
can look at the other files in the logical tranche/bucket and calculate the offset for the
RowBatch of the split.  (Presumably getRecordReader().getRowNumber() works the same in vector
mode).

In this case we don't even need OrcSplit.isOriginal() - the reader can infer it from file
path... which in particular simplifies OrcInputFormat.determineSplitStrategies()

  was:
VectorizedOrcAcidRowBatchReader will not be used for original files.  This will likely look
like a perf regression when converting a table from non-acid to acid until it runs through
a major compaction.

With Load Data support, if large files are added via Load Data, the read ops will not vectorize
until major compaction.  

There is no reason why this should be the case.  Just like OrcRawRecordMerger, VectorizedOrcAcidRowBatchReader
can look at the other files in the logical tranche/bucket and calculate the offset for the
RowBatch of the split.  (Presumably getRecordReader().getRowNumber() works the same in vector
mode).


> VectorizedOrcAcidRowBatchReader doesn't handle 'original' files
> ---------------------------------------------------------------
>
>                 Key: HIVE-17458
>                 URL: https://issues.apache.org/jira/browse/HIVE-17458
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> VectorizedOrcAcidRowBatchReader will not be used for original files.  This will likely
look like a perf regression when converting a table from non-acid to acid until it runs through
a major compaction.
> With Load Data support, if large files are added via Load Data, the read ops will not
vectorize until major compaction.  
> There is no reason why this should be the case.  Just like OrcRawRecordMerger, VectorizedOrcAcidRowBatchReader
can look at the other files in the logical tranche/bucket and calculate the offset for the
RowBatch of the split.  (Presumably getRecordReader().getRowNumber() works the same in vector
mode).
> In this case we don't even need OrcSplit.isOriginal() - the reader can infer it from
file path... which in particular simplifies OrcInputFormat.determineSplitStrategies()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message