hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
Date Fri, 02 Jun 2017 01:46:04 GMT
Eugene Koifman created HIVE-16812:
-------------------------------------

             Summary: VectorizedOrcAcidRowBatchReader doesn't filter delete events
                 Key: HIVE-16812
                 URL: https://issues.apache.org/jira/browse/HIVE-16812
             Project: Hive
          Issue Type: Improvement
          Components: Transactions
    Affects Versions: 2.3.0
            Reporter: Eugene Koifman
            Assignee: Eugene Koifman


the c'tor of VectorizedOrcAcidRowBatchReader has
{noformat}
    // Clone readerOptions for deleteEvents.
    Reader.Options deleteEventReaderOptions = readerOptions.clone();
    // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because
    // we always want to read all the delete delta files.
    deleteEventReaderOptions.range(0, Long.MAX_VALUE);
{noformat}

This is suboptimal since base and deltas are sorted by ROW__ID.  So for each split if base
we can find min/max ROW_ID and only load events from delta that are in [min,max] range.  This
will reduce the number of delete events we load in memory (to no more than there in the split).

When we support sorting on PK, the same should apply but we'd need to make sure to store PKs
in ORC index




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message