hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mat Kelcey <matthew.kel...@gmail.com>
Subject Re: is there a way to just abandon a map task?
Date Mon, 21 Nov 2011 01:18:24 GMT
Thanks for the suggestion Arun, I hadn't seen these params before.

No way to do it for a job in flight though I guess?


On 20 November 2011 16:43, Arun C Murthy <acm@hortonworks.com> wrote:
> Mat,
>  Take a look at mapred.max.(map|reduce).failures.percent.
>  See:
>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxMapTaskFailuresPercent(int)
>  http://hadoop.apache.org/common/docs/r0.20.205.0/api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceTaskFailuresPercent(int)
> hth,
> Arun
> On Nov 20, 2011, at 1:31 PM, Mat Kelcey wrote:
>> Hi,
>> I have a largish job running that, due to the quirks of the third
>> party input format I'm using, has 280,000 map tasks. ( I know this is
>> far from ideal but it's it'll do for me )
>> I'm passing this data (the common crawl web crawl dataset) through a
>> visible-text-from-html extraction library (boilerpipe) which is
>> struggling with _1_ particular task. It's hits a sequence of records
>> that are _insanely_ slow to parse for some reason. Rather than a few
>> minutes per split it's took 7+ hrs before I started explicitly trying
>> to fail the task (hadoop job -fail-task). Since I'm running with bad
>> record skipping I was hoping I could issue -fail-task a few times and
>> ride over the bad records but it looks like there's quite a few there.
>> Since it's only 1 of the 280,000 I'm actually happy to just give up on
>> the entire split.
>> Now if I was running a map only job I'd just kill the job since I'd
>> have the output of the other 279,999. This job has a no-op reduce step
>> though since I wanted to take the chance to compact the output into a
>> much smaller number of sequence files ( I regret that decision now) As
>> such I can't just kill the job since I'd lose the rest of the
>> processed data (if I understand correctly?)
>> So does anyone know a way to just abandon the entire split?
>> Cheers,
>> Mat

View raw message