Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 34219 invoked from network); 7 May 2010 12:49:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 May 2010 12:49:28 -0000 Received: (qmail 85233 invoked by uid 500); 7 May 2010 12:49:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 85168 invoked by uid 500); 7 May 2010 12:49:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 85160 invoked by uid 99); 7 May 2010 12:49:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 May 2010 12:49:27 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cryptcom@gmail.com designates 74.125.82.44 as permitted sender) Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 May 2010 12:49:22 +0000 Received: by wwd20 with SMTP id 20so8016wwd.31 for ; Fri, 07 May 2010 05:49:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=RQPKs7wJt77wM593ZQIMdl1yPJ5BdKezxUpTmOTxX9Q=; b=d22bV8pNflHrNGgc1lIm+PAKQv+0DUIG/5BBOA6RVmjqpTi2cIfJ0WcIqu8E1IjY44 VgDLbqHytNDZlG3ptjM6jr+MARSClDLCoa+QjH0ZcMqsB4sKnj068ZPp4rLBPCM7KTkF c7msLETg+WG3KX3kq3dB8p6LOS8he5EkzWGes= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=cAk45R7IMmqCH+WziLgYcq0j5q0UflJFGdPY3TNazpTN/chznaGDU7T0FgcMC2n7ew 1byfS1JaYlMGxc6GFYcCaBouKDLgzNMT3hSCNVwQtZ9xjO2n7FmplP3RIJMaC1NxjSqO 1jXG82kI7BLFgMb/dR2keYwVzAC4y40PfViig= MIME-Version: 1.0 Received: by 10.227.151.145 with SMTP id c17mr5768288wbw.157.1273236541509; Fri, 07 May 2010 05:49:01 -0700 (PDT) Received: by 10.216.184.73 with HTTP; Fri, 7 May 2010 05:49:01 -0700 (PDT) In-Reply-To: References: Date: Fri, 7 May 2010 08:49:01 -0400 Message-ID: Subject: Re: timeout while running simple hadoop job From: Joseph Stein To: user@cassandra.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable The problem could be that you are crunching more data than will be completed within the interval expire setting. In Hadoop you need to kind of tell the task tracker that you are still doing stuff which is done by setting status or incrementing counter on the Reporter object. http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-rea= l-cluster/ "In your Java code there is a little trick to help the job be =93aware=94 within the cluster of tasks that are not dead but just working hard. During execution of a task there is no built in reporting that the job is running as expected if it is not writing out. So this means that if your tasks are taking up a lot of time doing work it is possible the cluster will see that task as failed (based on the mapred.task.tracker.expiry.interval setting). Have no fear there is a way to tell cluster that your task is doing just fine. You have 2 ways todo this you can either report the status or increment a counter. Both of these will cause the task tracker to properly know the task is ok and this will get seen by the jobtracker in turn. Both of these options are explained in the JavaDoc http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/R= eporter.html" Hope this helps On Fri, May 7, 2010 at 4:47 AM, gabriele renzi wrote: > Hi everyone, > > I am trying to develop a mapreduce job that does a simple > selection+filter on the rows in our store. > Of course it is mostly based on the WordCount example :) > > > Sadly, while it seems the app runs fine on a test keyspace with little > data, when run on a larger test index (but still on a single node) I > reliably see this error in the logs > > 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001 > java.lang.RuntimeException: TimedOutException() > =A0 =A0 =A0 =A0at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$Ro= wIterator.maybeInit(ColumnFamilyRecordReader.java:165) > =A0 =A0 =A0 =A0at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$Ro= wIterator.computeNext(ColumnFamilyRecordReader.java:215) > =A0 =A0 =A0 =A0at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$Ro= wIterator.computeNext(ColumnFamilyRecordReader.java:97) > =A0 =A0 =A0 =A0at com.google.common.collect.AbstractIterator.tryToCompute= Next(AbstractIterator.java:135) > =A0 =A0 =A0 =A0at com.google.common.collect.AbstractIterator.hasNext(Abst= ractIterator.java:130) > =A0 =A0 =A0 =A0at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.ne= xtKeyValue(ColumnFamilyRecordReader.java:91) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReade= r.nextKeyValue(MapTask.java:423) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(Map= Context.java:67) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.j= ava:583) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > =A0 =A0 =A0 =A0at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJo= bRunner.java:176) > Caused by: TimedOutException() > =A0 =A0 =A0 =A0at org.apache.cassandra.thrift.Cassandra$get_range_slices_= result.read(Cassandra.java:11015) > =A0 =A0 =A0 =A0at org.apache.cassandra.thrift.Cassandra$Client.recv_get_r= ange_slices(Cassandra.java:623) > =A0 =A0 =A0 =A0at org.apache.cassandra.thrift.Cassandra$Client.get_range_= slices(Cassandra.java:597) > =A0 =A0 =A0 =A0at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$Ro= wIterator.maybeInit(ColumnFamilyRecordReader.java:142) > =A0 =A0 =A0 =A0... 11 more > > and after that the job seems to finish "normally" but no results are prod= uced. > > FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if > it ain't broke don't fix it). > > The single node has a data directory of about 127GB in two column > families, off which the one used in the mapred job is about 100GB. > The cassandra server is run with 6GB of heap on a box with 8GB > available and no swap enabled. read/write latency from cfstat are > > =A0 =A0 =A0 =A0Read Latency: 0.8535837762577986 ms. > =A0 =A0 =A0 =A0Write Latency: 0.028849603764075547 ms. > > row cache is not enabled, key cache percentage is default. Load on the > machine is basically zero when the job is not running. > > As my code is 99% that from the wordcount contrib, I shall notice that > In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we > can supposedly change, but it's apparently not used anywhere, but as I > said, running on a single node this should not be an issue anyway. > > Does anyone has suggestions or has seen this error before? On the > other hand, did people run this kind of jobs in similar conditions > flawlessly, so I can consider it just my problem? > > > Thanks in advance for any help. > --=20 /* Joe Stein http://www.linkedin.com/in/charmalloc */