hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: how to enhance job start up speed?
Date Mon, 13 Aug 2012 16:07:58 GMT
Seems like you want to misuse Hadoop but maybe I still don't understand
your context.

The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
matthias.mk.kricke@gmail.com> wrote:

> @Bejoy KS: Thanks for your advice.
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
> Did you have some more advice? Anyway thanks so far!
> Greetings,
> MK
> 2012/8/13 Bertrand Dechoux <dechouxb@gmail.com>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>> Bertrand
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:
>>> **
>>> Hi Matthais
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <matthias.mk.kricke@gmail.com>
>>> *Sender: * matthias.zengler@gmail.com
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<user@hadoop.apache.org>
>>> *ReplyTo: * user@hadoop.apache.org
>>> *Subject: *Re: how to enhance job start up speed?
>>> Ok, I try to clarify:
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>> Greetings,
>>> MK
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>> 2012/8/13 Bertrand Dechoux <dechouxb@gmail.com>
>>>> I am not sure to understand and I guess I am not the only one.
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>> Bertrand
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> matthias.mk.kricke@gmail.com> wrote:
>>>>> Hello all,
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file.
>>>>> process the V with the worker.
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>> worker time --> same in both cases
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>> I would appreciate your help,
>>>>> Greetings
>>>>> mk
>>>> --
>>>> Bertrand Dechoux
>> --
>> Bertrand Dechoux

Bertrand Dechoux

View raw message