Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BC49C107BD for ; Fri, 24 Jan 2014 02:40:19 +0000 (UTC) Received: (qmail 63276 invoked by uid 500); 24 Jan 2014 02:39:46 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 63054 invoked by uid 500); 24 Jan 2014 02:39:42 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 62470 invoked by uid 500); 24 Jan 2014 02:39:39 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 62446 invoked by uid 99); 24 Jan 2014 02:39:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2014 02:39:39 +0000 Date: Fri, 24 Jan 2014 02:39:38 +0000 (UTC) From: "Prasanth J (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (HIVE-6272) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth J resolved HIVE-6272. ------------------------------ Resolution: Duplicate Duplicate of HIVE-6287. Got created as JIRA was flaky yesterday. > batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled > ----------------------------------------------------------------------------------------------------- > > Key: HIVE-6272 > URL: https://issues.apache.org/jira/browse/HIVE-6272 > Project: Hive > Issue Type: Bug > Affects Versions: 0.13.0 > Reporter: Prasanth J > Assignee: Prasanth J > > nextBatch() method that computes the batchSize is only aware of stripe boundaries. This will not work when PPD in ORC is enabled as PPD works at row group level (stripe contains multiple row groups). By default, row group stride is 10000. When PPD is enabled, some row groups may get eliminated. After row group elimination, disk ranges are computed based on the selected row groups. If batchSize computation is not aware of this, it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should illustrate it more clearly > {code} > |--------------------------------- STRIPE 1 ------------------------------------| > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --| > |--------- diskrange 1 ---------| |- diskrange 2 -| > ^ > (marker) > {code} > diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since nextBatch() was not aware of row groups and hence the diskranges, it tries to read 1024 values from the end of diskrange 1 where it should only read 20000 % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is computed accordingly. {code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)