hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: how to enhance job start up speed?
Date Mon, 13 Aug 2012 15:07:40 GMT
It was almost what I was getting at but I was not sure about your problem.
Basically, Hadoop is only adding overhead due to the way your job is
constructed.
Now the question is : why do you need a single mapper? Is your need truly
not 'parallelisable'?

Bertrand

On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:

> **
> Hi Matthais
>
> When an mapreduce program is being used there are some extra steps like
> checking for input and output dir, calclulating input splits, JT assigning
> TT for executing the task etc.
>
> If your file is non splittable , then one map task per file will be
> generated irrespective of the number of hdfs blocks. Now some blocks will
> be in a different node than the node where map task is executed so time
> will be spend here on the network transfer.
>
> In your case MR would be a overhead as your file is non splittable hence
> no parallelism and also there is an overhead of copying blocks to the map
> task node.
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Matthias Kricke <matthias.mk.kricke@gmail.com>
> *Sender: * matthias.zengler@gmail.com
> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
> *To: *<user@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: how to enhance job start up speed?
>
> Ok, I try to clarify:
>
> 1) The worker is the logic inside my mapper and the same for both cases.
> 2) I have two cases. In the first one I use hadoop to execute my worker
> and in a second one, I execute my worker without hadoop (simple read of the
> file).
>    Now I measured, for both cases, the time the worker and
> the surroundings need (so i have two values for each case). The worker took
> the same time in both cases for the same input (this is expected). But the
> surroundings took 17%  more time when using hadoop.
> 3) ~  3GB.
>
> I want to know how to reduce this difference and where they come from.
> I hope that helped? If not, feel free to ask again :)
>
> Greetings,
> MK
>
> P.S. just for your information, I did the same test with hypertable as
> well.
> I got:
>  * worker without anything: 15% overhead
>  * worker with hadoop: 32% overhead
>  * worker with hypertable: 53% overhead
> Remark: overhead was measured in comparison to the worker. e.g. hypertable
> uses 53% of the whole process time, while worker uses 47%.
>
> 2012/8/13 Bertrand Dechoux <dechouxb@gmail.com>
>
>> I am not sure to understand and I guess I am not the only one.
>>
>> 1) What's a worker in your context? Only the logic inside your Mapper or
>> something else?
>> 2) You should clarify your cases. You seem to have two cases but both are
>> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
>> sequential is not Hadoop?
>> 3) What are the size of the file?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>> matthias.mk.kricke@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I'm using CDH3u3.
>>> If I want to process one File, set to non splitable hadoop starts one
>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>> goes through a configuration step where some variables for the worker
>>> inside the mapper are initialized.
>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>> process the V with the worker.
>>>
>>> When I compare the run time of hadoop to the run time of the same
>>> process in sequentiell manner, I get:
>>>
>>> worker time --> same in both cases
>>>
>>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>>> chunk size)
>>> case: sequentiell --> overhead of ~15% to the worker process
>>>
>>> It shouldn't be that much slower, because of non splitable, the mapper
>>> will be executed where the data is saved by HDFS, won't it?
>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>> time for reading or streaming the data out of HDFS?
>>>
>>> I would appreciate your help,
>>>
>>> Greetings
>>> mk
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Mime
View raw message