Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of mrevelle@gmail.com designates
 209.85.212.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=content-type:mime-version:subject:from:in-reply-to:date
         :content-transfer-encoding:message-id:references:to:x-mailer;
        b=FgS8/DtitClAGpCHXNVbvq4E108S+g4r52ahX09grcKV8jT/zQElvc+jy6OEtiaiXW
         i1vCOxceF6TM83unx37Y1mYpymepg9ngD+qpu+qBq8PWvr1K+oWhNb0mv790WkW8bk29
         FnQd4O7h9xjg9tOBKOMZc5RCDBjlDY50+btT0=
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1078)
Subject: Re: timeout while running simple hadoop job
From: Matt Revelle <mrevelle@gmail.com>
In-Reply-To: <h2re7c807841005070549lf2f3ac35pa36106a6f7fdbd34@mail.gmail.com>
Date: Fri, 7 May 2010 08:53:25 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <7263E82C-CDD9-4DAF-B825-1BD3FD5D9614@gmail.com>
References: <y2v828083e71005070147r181914b4w8e50ee13c5ee3355@mail.gmail.com>
 <h2re7c807841005070549lf2f3ac35pa36106a6f7fdbd34@mail.gmail.com>
To: user@cassandra.apache.org

There's also the mapred.task.timeout property that can be tweaked.  But =
reporting is the correct way to fix timeouts during execution.

On May 7, 2010, at 8:49 AM, Joseph Stein wrote:

> The problem could be that you are crunching more data than will be
> completed within the interval expire setting.
>=20
> In Hadoop you need to kind of tell the task tracker that you are still
> doing stuff which is done by setting status or incrementing counter on
> the Reporter object.
>=20
> =
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-re=
al-cluster/
>=20
> "In your Java code there is a little trick to help the job be =93aware=94=

> within the cluster of tasks that are not dead but just working hard.
> During execution of a task there is no built in reporting that the job
> is running as expected if it is not writing out.  So this means that
> if your tasks are taking up a lot of time doing work it is possible
> the cluster will see that task as failed (based on the
> mapred.task.tracker.expiry.interval setting).
>=20
> Have no fear there is a way to tell cluster that your task is doing
> just fine.  You have 2 ways todo this you can either report the status
> or increment a counter.  Both of these will cause the task tracker to
> properly know the task is ok and this will get seen by the jobtracker
> in turn.  Both of these options are explained in the JavaDoc
> =
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/=
Reporter.html"
>=20
> Hope this helps
>=20
> On Fri, May 7, 2010 at 4:47 AM, gabriele renzi <rff.rff@gmail.com> =
wrote:
>> Hi everyone,
>>=20
>> I am trying to develop a mapreduce job that does a simple
>> selection+filter on the rows in our store.
>> Of course it is mostly based on the WordCount example :)
>>=20
>>=20
>> Sadly, while it seems the app runs fine on a test keyspace with =
little
>> data, when run on a larger test index (but still on a single node) I
>> reliably see this error in the logs
>>=20
>> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001
>> java.lang.RuntimeException: TimedOutException()
>>        at =
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit=
(ColumnFamilyRecordReader.java:165)
>>        at =
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNe=
xt(ColumnFamilyRecordReader.java:215)
>>        at =
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNe=
xt(ColumnFamilyRecordReader.java:97)
>>        at =
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterat=
or.java:135)
>>        at =
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:1=
30)
>>        at =
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFa=
milyRecordReader.java:91)
>>        at =
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT=
ask.java:423)
>>        at =
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>        at =
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>        at =
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
>> Caused by: TimedOutException()
>>        at =
org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassand=
ra.java:11015)
>>        at =
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassand=
ra.java:623)
>>        at =
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.ja=
va:597)
>>        at =
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit=
(ColumnFamilyRecordReader.java:142)
>>        ... 11 more
>>=20
>> and after that the job seems to finish "normally" but no results are =
produced.
>>=20
>> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if
>> it ain't broke don't fix it).
>>=20
>> The single node has a data directory of about 127GB in two column
>> families, off which the one used in the mapred job is about 100GB.
>> The cassandra server is run with 6GB of heap on a box with 8GB
>> available and no swap enabled. read/write latency from cfstat are
>>=20
>>        Read Latency: 0.8535837762577986 ms.
>>        Write Latency: 0.028849603764075547 ms.
>>=20
>> row cache is not enabled, key cache percentage is default. Load on =
the
>> machine is basically zero when the job is not running.
>>=20
>> As my code is 99% that from the wordcount contrib, I shall notice =
that
>> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we
>> can supposedly change, but it's apparently not used anywhere, but as =
I
>> said, running on a single node this should not be an issue anyway.
>>=20
>> Does anyone has suggestions or has seen this error before? On the
>> other hand, did people run this kind of jobs in similar conditions
>> flawlessly, so I can consider it just my problem?
>>=20
>>=20
>> Thanks in advance for any help.
>>=20
>=20
>=20
>=20
> --=20
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */