hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <>
Subject Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct
Date Thu, 13 Dec 2012 13:59:32 GMT
You are welcome.

First things first. We can never compare Hadoop with traditional warehouse
systems or DBMSs. Both are meant for different purposes.

One small example, consider you have 1G of data, then there is nothing that
could match RDBMSs. You'll get the results instantly, as you have specified
above. Now, suppose your company is doing very good has grown very big and
you have 500TB of data. If you try to process this much data using any
traditional system you would face a lot of difficulty, as these systems
have got poor horizontal scalability. The only thing which you could is
increasing your H/W capacity, which can be done only upto a certain limit.
Now, Hadoop comes into picture here.

You can combine 'N' small machines together and utilize their power
collectively to process your huge data. Basic principle of distributed
computing. Long story short, you cannot evaluate the power of Hadoop on a
small dataset. If you are going to do some OLTP kinda thing, I would not
suggest Hadoop. Same holds good for Hive or Pig. Hadoop is basically a
batch processing system and not meant for realtime stuff.

Now, coming back to your actual question, the no. of mappers depends mainly
on the no. of InputSplits created by the InputFormat you are using to
process you data and the no. of reducers depend on the no of partitions
created after the map phase.


    Mohammad Tariq

On Thu, Dec 13, 2012 at 6:25 PM, imen Megdiche <>wrote:

> thank you for your explanantions. I  work in a pseudo distributed mode and
> not in cluster. Does your recommendation are also available  in this mode
> and how can i do to have an execution time increasing in function of the
> nbr of map reduces tasks, if it is possible.
> I don t understand in general how mapreduce is much performant in analysis
> then other systems like the datawarehouses. I have tested for example with
> hive a simple query "select sum(col1) from table1" and the resultts
> abtained with hive is in order of 10 min  and with oracle is in the order
> of 0, 20 min for a size of dat ain the order of 40 MB.
> Thank you.
> 2012/12/13 Mohammad Tariq <>
>> Hello Imen,
>>       If you have huge no of tasks then the overhead of managing the map
>> and reduce task creation begins to dominate the total job execution time.
>> Also, more tasks means you need more free cpu slots. If the slots are not
>> free then the data block of interest will be moved to some other node where
>> frees lots are available and it will consume time and it is also against
>> the most basic principle of Hadoop i.e data localization. So, the no. of
>> maps and reduces should be raised keeping all the factors in mind,
>> otherwise you may face performance issues.
>> HTH
>> Regards,
>>     Mohammad Tariq
>> On Thu, Dec 13, 2012 at 4:11 PM, Nitin Pawar <>wrote:
>>> If the number of maps or reducers your job launched are more than the
>>> jobqueue/cluster capacity, cpu time will increase
>>> On Dec 13, 2012 4:02 PM, "imen Megdiche" <>
>>> wrote:
>>>> Hello,
>>>> I am trying to increase the number of map and reduce tasks for a job
>>>> and even for the same data size, I noticed that the total time CPU
>>>> increases but I thought it would decrease. MapReduce is known for
>>>> performance calculation, but I do not see this when i  do these small
>>>> tests.
>>>> What de you thins about this issue ?

View raw message