cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joost Ouwerkerk <>
Subject Re: timeout while running simple hadoop job
Date Fri, 07 May 2010 13:02:20 GMT
Joseph, the stacktrace suggests that it's Thrift that's timing out,
not the Task.

Gabriele, I believe that your problem is caused by too much load on
Cassandra.  Get_range_slices is presently an expensive operation. I
had some success in reducing (although, it turns out, not eliminating)
this problem by requesting smaller batches from get_range_slices.  See


On Fri, May 7, 2010 at 8:49 AM, Joseph Stein <> wrote:
> The problem could be that you are crunching more data than will be
> completed within the interval expire setting.
> In Hadoop you need to kind of tell the task tracker that you are still
> doing stuff which is done by setting status or incrementing counter on
> the Reporter object.
> "In your Java code there is a little trick to help the job be “aware”
> within the cluster of tasks that are not dead but just working hard.
> During execution of a task there is no built in reporting that the job
> is running as expected if it is not writing out.  So this means that
> if your tasks are taking up a lot of time doing work it is possible
> the cluster will see that task as failed (based on the
> mapred.task.tracker.expiry.interval setting).
> Have no fear there is a way to tell cluster that your task is doing
> just fine.  You have 2 ways todo this you can either report the status
> or increment a counter.  Both of these will cause the task tracker to
> properly know the task is ok and this will get seen by the jobtracker
> in turn.  Both of these options are explained in the JavaDoc
> Hope this helps
> On Fri, May 7, 2010 at 4:47 AM, gabriele renzi <> wrote:
>> Hi everyone,
>> I am trying to develop a mapreduce job that does a simple
>> selection+filter on the rows in our store.
>> Of course it is mostly based on the WordCount example :)
>> Sadly, while it seems the app runs fine on a test keyspace with little
>> data, when run on a larger test index (but still on a single node) I
>> reliably see this error in the logs
>> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001
>> java.lang.RuntimeException: TimedOutException()
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(
>>        at
>>        at
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(
>>        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(
>>        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(
>>        at
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(
>>        at
>>        at org.apache.hadoop.mapred.LocalJobRunner$
>> Caused by: TimedOutException()
>>        at org.apache.cassandra.thrift.Cassandra$
>>        at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(
>>        at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(
>>        ... 11 more
>> and after that the job seems to finish "normally" but no results are produced.
>> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if
>> it ain't broke don't fix it).
>> The single node has a data directory of about 127GB in two column
>> families, off which the one used in the mapred job is about 100GB.
>> The cassandra server is run with 6GB of heap on a box with 8GB
>> available and no swap enabled. read/write latency from cfstat are
>>        Read Latency: 0.8535837762577986 ms.
>>        Write Latency: 0.028849603764075547 ms.
>> row cache is not enabled, key cache percentage is default. Load on the
>> machine is basically zero when the job is not running.
>> As my code is 99% that from the wordcount contrib, I shall notice that
>> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we
>> can supposedly change, but it's apparently not used anywhere, but as I
>> said, running on a single node this should not be an issue anyway.
>> Does anyone has suggestions or has seen this error before? On the
>> other hand, did people run this kind of jobs in similar conditions
>> flawlessly, so I can consider it just my problem?
>> Thanks in advance for any help.
> --
> /*
> Joe Stein
> */

View raw message