hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sindhu Hosamane <sindh...@gmail.com>
Subject Re: Ideal number of mappers and reducers to increase performance
Date Mon, 04 Aug 2014 19:56:46 GMT
Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. why is the performance
not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
(defn convert-to-long [a]
	     (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out ?highest-value)) 


On 04 Aug 2014, at 19:10, Felix Chern <idryman@gmail.com> wrote:

> The mapper and reducer numbers really depends on what your program is trying to do. Without
your actual query it’s really difficult to tell why you are having this problem.
> For example, if you tried to perform a global sum or count, cascalog will only use one
reducer since this is the only way to do a global sum/count. To avoid this behavior you can
set a output key that can generally split the reducer. e.g. word count example use word as
the output key. With this word count output you can sum it up in a serial manner or run the
global map reduce job with this much smaller input.
> The mapper number is usually not a performance bottleneck. For your curious, if the file
is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled
by the split size in configuration. The smaller the split size it is, the more mappers are
> In short, your problem is not likely to be a configuration problem, but misunderstood
the map reduce logic. To solve your problem, can you paste your cascalog query and let people
take a look?
> Felix
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <sindhuht@gmail.com> wrote:
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1
node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> What other things should i set? Also i made sure hadoop is restarted every time after
changing config.
>> I have attached my conf folder ..please indicate me what should be added where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> Regards,
>> Sindhu

View raw message