hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Serdaru (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-9530) DBInputSplit creates invalid ranges on Oracle
Date Wed, 01 May 2013 02:37:13 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Serdaru updated HADOOP-9530:
-----------------------------------

    Description: 
The DBInputFormat on Oracle does not create valid ranges.

The method getSplit line 263 is as follows:

          split = new DBInputSplit(i * chunkSize, (i * chunkSize) + chunkSize);

So the first split will have a start value of 0 (0*chunkSize).

However, the OracleDBRecordReader, line 84 is as follows:

      if (split.getLength() > 0 && split.getStart() > 0){

Since the start value of the first range is equal to 0, we will skip the block that partitions
the input set. As a result, one of the map task will process the entire data set, rather than
the partition.

I'm assuming the fix is trivial and would involve removing the second check in the if block.

Also, I believe the OracleDBRecordReader paging query is incorrect.

Line 92 should read:

  query.append(" ) WHERE dbif_rno > ").append(split.getStart());

instead of (note > instead of >=)

  query.append(" ) WHERE dbif_rno >= ").append(split.getStart());

Otherwise some rows will be ignored and some counted more than once.

A map/reduce job that counts the number of rows based on a predicate will highlight the incorrect
behavior.



  was:
The DBInputFormat on Oracle does not create valid ranges.

The method getSplit line 263 is as follows:

          split = new DBInputSplit(i * chunkSize, (i * chunkSize)
              + chunkSize);

So the first split will have a start value of 0 (0*chunkSize).

However, the OracleDBRecordReader, line 84 is as follows:

      if (split.getLength() > 0 && split.getStart() > 0){

Since the start value of the first range is equal to 0, we will skip the block that partitions
the input set. As a result, one of the map task will process the entire data set, rather than
the partition.

I'm assuming the fix is trivial and would involve removing the second check in the if block.

Also, I believe the OracleDBRecordReader paging query is incorrect.

Line 92 should read:

  query.append(" ) WHERE dbif_rno > ").append(split.getStart());

instead of (note > instead of >=)

  query.append(" ) WHERE dbif_rno >= ").append(split.getStart());

Otherwise some rows will be ignored and some counted more than once.

A map/reduce job that counts the number of rows based on a predicate will highlight the incorrect
behavior.



    
> DBInputSplit creates invalid ranges on Oracle
> ---------------------------------------------
>
>                 Key: HADOOP-9530
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9530
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 1.1.2
>            Reporter: Julien Serdaru
>
> The DBInputFormat on Oracle does not create valid ranges.
> The method getSplit line 263 is as follows:
>           split = new DBInputSplit(i * chunkSize, (i * chunkSize) + chunkSize);
> So the first split will have a start value of 0 (0*chunkSize).
> However, the OracleDBRecordReader, line 84 is as follows:
>       if (split.getLength() > 0 && split.getStart() > 0){
> Since the start value of the first range is equal to 0, we will skip the block that partitions
the input set. As a result, one of the map task will process the entire data set, rather than
the partition.
> I'm assuming the fix is trivial and would involve removing the second check in the if
block.
> Also, I believe the OracleDBRecordReader paging query is incorrect.
> Line 92 should read:
>   query.append(" ) WHERE dbif_rno > ").append(split.getStart());
> instead of (note > instead of >=)
>   query.append(" ) WHERE dbif_rno >= ").append(split.getStart());
> Otherwise some rows will be ignored and some counted more than once.
> A map/reduce job that counts the number of rows based on a predicate will highlight the
incorrect behavior.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message