hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Ideal number of mappers and reducers to increase performance
Date Thu, 07 Aug 2014 10:36:32 GMT
Felix has already explained most of the characteristics that define
the parallelism of MR jobs.

How many mappers does your program run? Your parallel performance
depends on how much parallelism your job actually runs with, aside of
what the platform is providing as a capability. Perhaps for your input
it only uses two map tasks (due to only 2 input splits), so it
wouldn't go any faster by default. Or perhaps your input is a single
non-splittable file, such as a gzip compressed text file, which would
only yield one map task.

As to the reduces question, if you are using an expressive wrapper
such as cascalog, then it also depends on what you are doing in it. If
you are computing an operation such as a total count, or a global max
for example, then the wrapper may by itself set the # of reducers to
1. I'm not aware of cascalog's internals, but that may be worth
looking into.

P.s. Just in case it was a typo though, the property is
mapred.reduce.tasks, not mapped.reduce.tasks.

On Tue, Aug 5, 2014 at 1:26 AM, Sindhu Hosamane <sindhuht@gmail.com> wrote:
> Thanks a lot for your explanation Felix .
> MY query is not using global sort/count. But still i am unable to understand
> -
> even i set the mapped.reduce.tasks=4
> when the hadoop job runs i still see
> 14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
> 14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
>
> Does that look ok , numReduceTasks should be 4 right ?
> Also i am pasting my cascalog query below. Please point me where am i wrong.
> why is the performance not increased?
>
> Cascalog code
> (def info
>       (hfs-delimited  "/users/si/File.txt"
>                        :delimiter ";"
>                        :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
>                        :classes [String String String String  ]
>                        :skip-header? true))
>
>
>
> (defn convert-to-long [a]
>     (ct/to-long (f/parse custom-formatter a)))
>
> (def info-tap
>   (<- [?timestamp  ?BTT367 ]
>       ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))
>
> (defn convert-to-float [a]
>   (try
>     (if (not= a " ")
>       (read-string a))
>    (catch Exception e (do
>  nil))))
>
>  (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
>       (convert-to-float ?BTT367 :> ?converted-BTT367 )
>       (convert-to-long ?timestamp :> ?converted-timestamp)
>       (>= ?converted-timestamp start-value)
>       (<= ?converted-timestamp end-value)
>       (:sort ?converted-BTT367)(:reverse true)
>       (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
> ?highest-value))
>
>
> Regards,
> Sindhu
>
>
>
>
>
> On 04 Aug 2014, at 19:10, Felix Chern <idryman@gmail.com> wrote:
>
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number
> of mapper number is controlled by the split size in configuration. The
> smaller the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste
> your cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <sindhuht@gmail.com> wrote:
>
>
> I am not coding in mapreduce. I am running my cascalog queries on hadoop
> cluster(1 node ) on data of size 280MB. So all the config settings has to be
> made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>  and mapred.tasktracker.reduce.tasks.maximum = 4
> and then kept tuning it up ways and down ways  like below
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>
> But all the time performance remains same .
> Everytime, inspite whatever combination of
> mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution
> time .
>
> Then when the above things failed i also tried mapred.reduce.tasks = 4
> still results are same. No reduction in execution time.
>
> What other things should i set? Also i made sure hadoop is restarted every
> time after changing config.
> I have attached my conf folder ..please indicate me what should be added
> where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
>
> Regards,
> Sindhu
>
>
>



-- 
Harsh J

Mime
View raw message