cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Revelle <mreve...@gmail.com>
Subject Re: timeout while running simple hadoop job
Date Fri, 07 May 2010 12:53:25 GMT
There's also the mapred.task.timeout property that can be tweaked.  But reporting is the correct
way to fix timeouts during execution.

On May 7, 2010, at 8:49 AM, Joseph Stein wrote:

> The problem could be that you are crunching more data than will be
> completed within the interval expire setting.
> 
> In Hadoop you need to kind of tell the task tracker that you are still
> doing stuff which is done by setting status or incrementing counter on
> the Reporter object.
> 
> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
> 
> "In your Java code there is a little trick to help the job be “aware”
> within the cluster of tasks that are not dead but just working hard.
> During execution of a task there is no built in reporting that the job
> is running as expected if it is not writing out.  So this means that
> if your tasks are taking up a lot of time doing work it is possible
> the cluster will see that task as failed (based on the
> mapred.task.tracker.expiry.interval setting).
> 
> Have no fear there is a way to tell cluster that your task is doing
> just fine.  You have 2 ways todo this you can either report the status
> or increment a counter.  Both of these will cause the task tracker to
> properly know the task is ok and this will get seen by the jobtracker
> in turn.  Both of these options are explained in the JavaDoc
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html"
> 
> Hope this helps
> 
> On Fri, May 7, 2010 at 4:47 AM, gabriele renzi <rff.rff@gmail.com> wrote:
>> Hi everyone,
>> 
>> I am trying to develop a mapreduce job that does a simple
>> selection+filter on the rows in our store.
>> Of course it is mostly based on the WordCount example :)
>> 
>> 
>> Sadly, while it seems the app runs fine on a test keyspace with little
>> data, when run on a larger test index (but still on a single node) I
>> reliably see this error in the logs
>> 
>> 10/05/06 16:37:58 WARN mapred.LocalJobRunner: job_local_0001
>> java.lang.RuntimeException: TimedOutException()
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
>>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
>>        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>>        at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
>> Caused by: TimedOutException()
>>        at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
>>        at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
>>        at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
>>        at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
>>        ... 11 more
>> 
>> and after that the job seems to finish "normally" but no results are produced.
>> 
>> FWIW this is on 0.6.0 (we didn't move to 0.6.1 yet because, well, if
>> it ain't broke don't fix it).
>> 
>> The single node has a data directory of about 127GB in two column
>> families, off which the one used in the mapred job is about 100GB.
>> The cassandra server is run with 6GB of heap on a box with 8GB
>> available and no swap enabled. read/write latency from cfstat are
>> 
>>        Read Latency: 0.8535837762577986 ms.
>>        Write Latency: 0.028849603764075547 ms.
>> 
>> row cache is not enabled, key cache percentage is default. Load on the
>> machine is basically zero when the job is not running.
>> 
>> As my code is 99% that from the wordcount contrib, I shall notice that
>> In 0.6.1's contrib (and trunk) there is a RING_DELAY constant that we
>> can supposedly change, but it's apparently not used anywhere, but as I
>> said, running on a single node this should not be an issue anyway.
>> 
>> Does anyone has suggestions or has seen this error before? On the
>> other hand, did people run this kind of jobs in similar conditions
>> flawlessly, so I can consider it just my problem?
>> 
>> 
>> Thanks in advance for any help.
>> 
> 
> 
> 
> -- 
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */


Mime
View raw message