Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 34803 invoked from network); 7 May 2010 12:54:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 May 2010 12:54:02 -0000 Received: (qmail 94300 invoked by uid 500); 7 May 2010 12:54:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 94219 invoked by uid 500); 7 May 2010 12:54:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 94211 invoked by uid 99); 7 May 2010 12:54:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 May 2010 12:54:01 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mrevelle@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vw0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 May 2010 12:53:54 +0000 Received: by vws9 with SMTP id 9so946727vws.31 for ; Fri, 07 May 2010 05:53:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:content-type:mime-version :subject:from:in-reply-to:date:content-transfer-encoding:message-id :references:to:x-mailer; bh=z98nJ1uwOd2+ZxrnkHEs4eX8h5PjeGQmPEfQp+AsFNs=; b=vHASJYRizRPepuuqYydVQ10VY8ao0qL3SExFdRG+NMLxQO3E2KiW1RGLSJ3AhdfxPH DcMtyaJODWz6KhTux/y5OB+rzq9hQJmxTG+SqcJ/ZpGOkg5/mRYHJW2gqXn/Wzw6zxEs e+L3KeH3/2MoElwWKaZkmi7k/BJ23X1BWbFcg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; b=FgS8/DtitClAGpCHXNVbvq4E108S+g4r52ahX09grcKV8jT/zQElvc+jy6OEtiaiXW i1vCOxceF6TM83unx37Y1mYpymepg9ngD+qpu+qBq8PWvr1K+oWhNb0mv790WkW8bk29 FnQd4O7h9xjg9tOBKOMZc5RCDBjlDY50+btT0= Received: by 10.229.211.204 with SMTP id gp12mr6302986qcb.59.1273236811480; Fri, 07 May 2010 05:53:31 -0700 (PDT) Received: from [10.0.0.18] ([208.78.149.14]) by mx.google.com with ESMTPS id v37sm1133629qce.6.2010.05.07.05.53.29 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 07 May 2010 05:53:30 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1078) Subject: Re: timeout while running simple hadoop job From: Matt Revelle In-Reply-To: Date: Fri, 7 May 2010 08:53:25 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <7263E82C-CDD9-4DAF-B825-1BD3FD5D9614@gmail.com> References: To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1078) X-Virus-Checked: Checked by ClamAV on apache.org There's also the mapred.task.timeout property that can be tweaked. But = reporting is the correct way to fix timeouts during execution. On May 7, 2010, at 8:49 AM, Joseph Stein wrote: > The problem could be that you are crunching more data than will be > completed within the interval expire setting. >=20 > In Hadoop you need to kind of tell the task tracker that you are still > doing stuff which is done by setting status or incrementing counter on > the Reporter object. >=20 > = http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-re= al-cluster/ >=20 > "In your Java code there is a little trick to help the job be =93aware=94= > within the cluster of tasks that are not dead but just working hard. > During execution of a task there is no built in reporting that the job > is running as expected if it is not writing out. So this means that > if your tasks are taking up a lot of time doing work it is possible > the cluster will see that task as failed (based on the > mapred.task.tracker.expiry.interval setting). >=20 > Have no fear there is a way to tell cluster that your task is doing > just fine. You have 2 ways todo this you can either report the status > or increment a counter. Both of these will cause the task tracker to > properly know the task is ok and this will get seen by the jobtracker > in turn. Both of these options are explained in the JavaDoc > = http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/= Reporter.html" >=20 > Hope this helps >=20 > On Fri, May 7, 2010 at 4:47 AM, gabriele renzi = wrote: >> Hi everyone, >>=20 >> I am trying to develop a mapreduce job that does a simple >> selection+filter on the rows in our store. >> Of course it is mostly based on the WordCount example :) >>=20 >>=20 >> Sadly, while it seems the app runs fine on a test keyspace with = little >> data, when run on a larger test index (but still on a single node) I >> reliably see this error in the logs >>=20 >> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001 >> java.lang.RuntimeException: TimedOutException() >> at = org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit= (ColumnFamilyRecordReader.java:165) >> at = org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNe= xt(ColumnFamilyRecordReader.java:215) >> at = org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNe= xt(ColumnFamilyRecordReader.java:97) >> at = com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterat= or.java:135) >> at = com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:1= 30) >> at = org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFa= milyRecordReader.java:91) >> at = org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT= ask.java:423) >> at = org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) >> at = org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >> at = org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) >> Caused by: TimedOutException() >> at = org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassand= ra.java:11015) >> at = org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassand= ra.java:623) >> at = org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.ja= va:597) >> at = org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit= (ColumnFamilyRecordReader.java:142) >> ... 11 more >>=20 >> and after that the job seems to finish "normally" but no results are = produced. >>=20 >> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if >> it ain't broke don't fix it). >>=20 >> The single node has a data directory of about 127GB in two column >> families, off which the one used in the mapred job is about 100GB. >> The cassandra server is run with 6GB of heap on a box with 8GB >> available and no swap enabled. read/write latency from cfstat are >>=20 >> Read Latency: 0.8535837762577986 ms. >> Write Latency: 0.028849603764075547 ms. >>=20 >> row cache is not enabled, key cache percentage is default. Load on = the >> machine is basically zero when the job is not running. >>=20 >> As my code is 99% that from the wordcount contrib, I shall notice = that >> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we >> can supposedly change, but it's apparently not used anywhere, but as = I >> said, running on a single node this should not be an issue anyway. >>=20 >> Does anyone has suggestions or has seen this error before? On the >> other hand, did people run this kind of jobs in similar conditions >> flawlessly, so I can consider it just my problem? >>=20 >>=20 >> Thanks in advance for any help. >>=20 >=20 >=20 >=20 > --=20 > /* > Joe Stein > http://www.linkedin.com/in/charmalloc > */