Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7D857C6B7 for ; Fri, 18 May 2012 17:09:32 +0000 (UTC) Received: (qmail 13347 invoked by uid 500); 18 May 2012 17:09:32 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 13282 invoked by uid 500); 18 May 2012 17:09:32 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 13233 invoked by uid 99); 18 May 2012 17:09:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 May 2012 17:09:32 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 May 2012 17:09:30 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id CF134CC0A for ; Fri, 18 May 2012 17:09:10 +0000 (UTC) Date: Fri, 18 May 2012 17:09:10 +0000 (UTC) From: =?utf-8?Q?Bart=C5=82omiej_Roma=C5=84ski_=28JIRA=29?= To: commits@cassandra.apache.org Message-ID: <1206312509.14894.1337360950849.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <197316129.14891.1337360950721.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (CASSANDRA-4259) Bug in SSTableReader.getSampleIndexesForRanges(...) causes uneven InputSplits generation for Hadoop mappers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-4259?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bart=C5=82omiej Roma=C5=84ski updated CASSANDRA-4259: ------------------------------------------- Description:=20 Running a simple mapreduce job on cassandra column family results in creati= ng multiple small mappers for one half of the ring and one big mapper for t= he other half. Upper part (85... - 0) is cut into smaller slices. Lower par= t (0 - 85...) generates one big input slice. One mapper processing half of = the ring causes huge inefficiency. Also the progress meter for this mapper = is incorrect - it goes to 100% in a couple of seconds, than stays at 100% f= or an hour or two. I've investigated the problem a bit. I think it is related to incorrect out= put of 'nodetool rangekeysample'. On the node resposible for part (0 - 85..= .) the output is empty! On the other node it works fine. I think the bug is in SSTableReader.getSampleIndexesForRanges(...). These t= wo lines: RowPosition leftPosition =3D range.left.maxKeyBound(); RowPosition rightPosition =3D range.left.maxKeyBound(); should be changed to: RowPosition leftPosition =3D range.left.maxKeyBound(); RowPosition rightPosition =3D range.right.maxKeyBound(); After that fix the output of nodetool is correct and the whole ring is spli= t into small mappers. The other half of the ring works fine because of extra 'if' in the code: int right =3D Range.isWrapAround(range.left, range.right)... This causes that the bug does not show up in one-node cluster or in the "la= st" ring partition in muli-node clusters. Can anyone look at it and verify my thoughts? I'm rather new to Cassandra. was: Running a simple mapreduce job on cassandra column family results in creati= ng multiple small mappers for one half of the ring and one big mapper for t= he other half. Upper part (85... - 0) is cut into smaller slices. Lower par= t (0 - 85...) generates one big input slice. One mapper processing half of = the ring causes huge inefficiency. Also the progress meter for this mapper = is incorrect - it goes to 100% in a couple of second that stays at 100% for= an hour or two. I've investigated the problem a bit. I think it is related to incorrect out= put of 'nodetool rangekeysample'. On the node resposible for part (0 - 85..= .) the output is empty! On the other node it works fine. I think the bug is in SSTableReader.getSampleIndexesForRanges(...). This to= lines: RowPosition leftPosition =3D range.left.maxKeyBound(); RowPosition rightPosition =3D range.left.maxKeyBound(); should be changed to: RowPosition leftPosition =3D range.left.maxKeyBound(); RowPosition rightPosition =3D range.right.maxKeyBound(); After that fix the output of nodetool is correct and the whole ring is spli= t into small mappers. The other half of the ring works fine because of extra 'if' in the code: int right =3D Range.isWrapAround(range.left, range.right)... This causes that the bug does not show up in one-node cluster or in the "la= st" ring partition in muli-node clusters. Can anyone look at it and verify my thoughts? I'm rather new to Cassandra. =20 > Bug in SSTableReader.getSampleIndexesForRanges(...) causes uneven InputSp= lits generation for Hadoop mappers > -------------------------------------------------------------------------= ---------------------------------- > > Key: CASSANDRA-4259 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4259 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Affects Versions: 1.1.0 > Environment: Small cassandra cluster with 2 nodes. Version 1.1.0.= =20 > Tokens: 0, 85070591730234615865843651857942052864 > Hadoop 1.0.1 and Pig 0.10.0. > Reporter: Bart=C5=82omiej Roma=C5=84ski > > Running a simple mapreduce job on cassandra column family results in crea= ting multiple small mappers for one half of the ring and one big mapper for= the other half. Upper part (85... - 0) is cut into smaller slices. Lower p= art (0 - 85...) generates one big input slice. One mapper processing half o= f the ring causes huge inefficiency. Also the progress meter for this mappe= r is incorrect - it goes to 100% in a couple of seconds, than stays at 100%= for an hour or two. > I've investigated the problem a bit. I think it is related to incorrect o= utput of 'nodetool rangekeysample'. On the node resposible for part (0 - 85= ...) the output is empty! On the other node it works fine. > I think the bug is in SSTableReader.getSampleIndexesForRanges(...). These= two lines: > RowPosition leftPosition =3D range.left.maxKeyBound(); > RowPosition rightPosition =3D range.left.maxKeyBound(); > should be changed to: > RowPosition leftPosition =3D range.left.maxKeyBound(); > RowPosition rightPosition =3D range.right.maxKeyBound(); > After that fix the output of nodetool is correct and the whole ring is sp= lit into small mappers. > The other half of the ring works fine because of extra 'if' in the code: > int right =3D Range.isWrapAround(range.left, range.right)... > This causes that the bug does not show up in one-node cluster or in the "= last" ring partition in muli-node clusters. > Can anyone look at it and verify my thoughts? I'm rather new to Cassandra= . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira