cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bartłomiej Romański (JIRA) <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-4259) Bug in SSTableReader.getSampleIndexesForRanges(...) causes uneven InputSplits generation for Hadoop mappers
Date Fri, 18 May 2012 17:09:10 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bartłomiej Romański updated CASSANDRA-4259:
-------------------------------------------

    Description: 
Running a simple mapreduce job on cassandra column family results in creating multiple small
mappers for one half of the ring and one big mapper for the other half. Upper part (85...
- 0) is cut into smaller slices. Lower part (0 - 85...) generates one big input slice. One
mapper processing half of the ring causes huge inefficiency. Also the progress meter for this
mapper is incorrect - it goes to 100% in a couple of seconds, than stays at 100% for an hour
or two.

I've investigated the problem a bit. I think it is related to incorrect output of 'nodetool
rangekeysample'. On the node resposible for part (0 - 85...) the output is empty! On the other
node it works fine.

I think the bug is in SSTableReader.getSampleIndexesForRanges(...). These two lines:

   RowPosition leftPosition = range.left.maxKeyBound();
   RowPosition rightPosition = range.left.maxKeyBound();

should be changed to:

   RowPosition leftPosition = range.left.maxKeyBound();
   RowPosition rightPosition = range.right.maxKeyBound();

After that fix the output of nodetool is correct and the whole ring is split into small mappers.

The other half of the ring works fine because of extra 'if' in the code:

   int right = Range.isWrapAround(range.left, range.right)...

This causes that the bug does not show up in one-node cluster or in the "last" ring partition
in muli-node clusters.

Can anyone look at it and verify my thoughts? I'm rather new to Cassandra.


  was:
Running a simple mapreduce job on cassandra column family results in creating multiple small
mappers for one half of the ring and one big mapper for the other half. Upper part (85...
- 0) is cut into smaller slices. Lower part (0 - 85...) generates one big input slice. One
mapper processing half of the ring causes huge inefficiency. Also the progress meter for this
mapper is incorrect - it goes to 100% in a couple of second that stays at 100% for an hour
or two.

I've investigated the problem a bit. I think it is related to incorrect output of 'nodetool
rangekeysample'. On the node resposible for part (0 - 85...) the output is empty! On the other
node it works fine.

I think the bug is in SSTableReader.getSampleIndexesForRanges(...). This to lines:

   RowPosition leftPosition = range.left.maxKeyBound();
   RowPosition rightPosition = range.left.maxKeyBound();

should be changed to:

   RowPosition leftPosition = range.left.maxKeyBound();
   RowPosition rightPosition = range.right.maxKeyBound();

After that fix the output of nodetool is correct and the whole ring is split into small mappers.

The other half of the ring works fine because of extra 'if' in the code:

   int right = Range.isWrapAround(range.left, range.right)...

This causes that the bug does not show up in one-node cluster or in the "last" ring partition
in muli-node clusters.

Can anyone look at it and verify my thoughts? I'm rather new to Cassandra.


    
> Bug in SSTableReader.getSampleIndexesForRanges(...) causes uneven InputSplits generation
for Hadoop mappers
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4259
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4259
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 1.1.0
>         Environment: Small cassandra cluster with 2 nodes. Version 1.1.0. 
> Tokens: 0, 85070591730234615865843651857942052864
> Hadoop 1.0.1 and Pig 0.10.0.
>            Reporter: Bartłomiej Romański
>
> Running a simple mapreduce job on cassandra column family results in creating multiple
small mappers for one half of the ring and one big mapper for the other half. Upper part (85...
- 0) is cut into smaller slices. Lower part (0 - 85...) generates one big input slice. One
mapper processing half of the ring causes huge inefficiency. Also the progress meter for this
mapper is incorrect - it goes to 100% in a couple of seconds, than stays at 100% for an hour
or two.
> I've investigated the problem a bit. I think it is related to incorrect output of 'nodetool
rangekeysample'. On the node resposible for part (0 - 85...) the output is empty! On the other
node it works fine.
> I think the bug is in SSTableReader.getSampleIndexesForRanges(...). These two lines:
>    RowPosition leftPosition = range.left.maxKeyBound();
>    RowPosition rightPosition = range.left.maxKeyBound();
> should be changed to:
>    RowPosition leftPosition = range.left.maxKeyBound();
>    RowPosition rightPosition = range.right.maxKeyBound();
> After that fix the output of nodetool is correct and the whole ring is split into small
mappers.
> The other half of the ring works fine because of extra 'if' in the code:
>    int right = Range.isWrapAround(range.left, range.right)...
> This causes that the bug does not show up in one-node cluster or in the "last" ring partition
in muli-node clusters.
> Can anyone look at it and verify my thoughts? I'm rather new to Cassandra.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message