hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raviteja Chirala" <rte...@gmail.com>
Subject Re: Why is Hadoop always running just 4 tasks?
Date Thu, 12 Dec 2013 01:40:48 GMT
Adam is right. It runs 1 map per 1 gz file even if you concat and isSplittable is true. Hadoop
will not parse your gz file to verify fileSeek. 

—
Sent from Mailbox for iPad

On Wed, Dec 11, 2013 at 11:46 AM, Adam Kawa <kawa.adam@gmail.com> wrote:

> I am not sure if Hadoop detects that. I guess that it will run one map
> tasks for them. Please let me know, if I am wrong.
> 2013/12/11 Dror, Ittay <idror@akamai.com>
>> OK, thank you for the solution.
>>
>> BTW I just concatenated several .gz files together with cat  (without
>> uncompressing first). So they should each uncompress individually
>>
>>
>>
>> From: Adam Kawa <kawa.adam@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Wednesday, December 11, 2013 9:33 PM
>>
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Re: Why is Hadoop always running just 4 tasks?
>>
>> mapred.map.tasks is rather a hint to InputFormat (
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces) and it is ignored in
>> your case.
>>
>> You process gz files, and InputFormat has isSplitatble method that for gz
>> files it returns false, so that each map tasks process a whole file (this
>> is related with gz files - you can not uncompress a part of gzipped file.
>> To uncompress it, you must read it from the beginning to the end).
>>
>>
>>
>>
>> 2013/12/11 Dror, Ittay <idror@akamai.com>
>>
>>> Thank you.
>>>
>>> The command is:
>>> hadoop jar /tmp/Algo-0.0.1.jar com.twitter.scalding.Tool com.akamai.Algo
>>> --hdfs --header --input /algo/input{0..3}.gz --output /algo/output
>>>
>>> Btw, the Hadoop version is 1.2.1
>>>
>>> Not sure what driver you are referring to.
>>> Regards,
>>> Ittay
>>>
>>> From: Mirko Kämpf <mirko.kaempf@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Date: Wednesday, December 11, 2013 6:21 PM
>>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>>> Subject: Re: Why is Hadoop always running just 4 tasks?
>>>
>>> Hi,
>>>
>>> what is the command you execute to submit the job?
>>> Please share also the driver code ....
>>>
>>> So we can troubleshoot better.
>>>
>>> Best wishes
>>> Mirko
>>>
>>>
>>>
>>>
>>> 2013/12/11 Dror, Ittay <idror@akamai.com>
>>>
>>>> I have a cluster of 4 machines with 24 cores and 7 disks each.
>>>>
>>>> On each node I copied from local a file of 500G. So I have 4 files in
>>>> hdfs with many blocks. My replication factor is 1.
>>>>
>>>> I run a job (a scalding flow) and while there are 96 reducers pending,
>>>> there are only 4 active map tasks.
>>>>
>>>> What am I doing wrong? Below is the configuration
>>>>
>>>> Thanks,
>>>> Ittay
>>>>
>>>> <configuration>
>>>> <property>
>>>> <name>mapred.job.tracker</name>
>>>>  <value>master:54311</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>mapred.map.tasks</name>
>>>>  <value>96</value>
>>>> </property>
>>>>
>>>> <property>
>>>>  <name>mapred.reduce.tasks</name>
>>>>  <value>96</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>mapred.local.dir</name>
>>>>
>>>> <value>/hdfs/0/mapred/local,/hdfs/1/mapred/local,/hdfs/2/mapred/local,/hdfs/3/mapred/local,/hdfs/4/mapred/local,/hdfs/5/mapred/local,/hdfs/6/mapred/local,/hdfs/7/mapred/local</value>
>>>> </property>
>>>>
>>>> <property>
>>>> <name>mapred.tasktracker.map.tasks.maximum</name>
>>>> <value>24</value>
>>>> </property>
>>>>
>>>> <property>
>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>     <value>24</value>
>>>> </property>
>>>> </configuration>
>>>>
>>>
>>>
>>
Mime
View raw message