hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
Date Thu, 30 Jan 2014 09:40:09 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886446#comment-13886446
] 

Hive QA commented on HIVE-6287:
-------------------------------



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12625925/HIVE-6287.3.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 4973 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
{noformat}

Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12625925

> batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when
PPD is enabled
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-6287
>                 URL: https://issues.apache.org/jira/browse/HIVE-6287
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, vectorization
>         Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, HIVE-6287.3.patch,
HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe boundaries. This
will not work when predicate pushdown (PPD) in ORC is enabled as PPD works at row group level
(stripe contains multiple row groups). By default, row group stride is 10000. When PPD is
enabled, some row groups may get eliminated. After row group elimination, disk ranges are
computed based on the selected row groups. If batchSize computation is not aware of this,
it will lead to BufferUnderFlowException (reading beyond disk range). Following scenario should
illustrate it more clearly
> {code}
> |--------------------------------- STRIPE 1 ------------------------------------|
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 --|
>                 |--------- diskrange 1 ---------|               |- diskrange 2 -|
>                                                 ^
>                                              (marker)   
> {code}
> diskrange1 will have 20000 rows and diskrange 2 will have 10000 rows. Since nextBatch()
was not aware of row groups and hence the diskranges, it tries to read 1024 values from the
end of diskrange 1 where it should only read 20000 % 1024 = 544 values. This will result in
BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is computed accordingly.
{code}batchSize = Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message