Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9905DDB9E for ; Mon, 13 Aug 2012 13:58:19 +0000 (UTC) Received: (qmail 69113 invoked by uid 500); 13 Aug 2012 13:58:14 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 69029 invoked by uid 500); 13 Aug 2012 13:58:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 69022 invoked by uid 99); 13 Aug 2012 13:58:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2012 13:58:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qa0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2012 13:58:08 +0000 Received: by qady1 with SMTP id y1so2589623qad.14 for ; Mon, 13 Aug 2012 06:57:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=HKclUouTYaHVmIAmeI4dC6SFalGNUs4no1dn/tjAfEU=; b=cJbU2L9PAeD/fS/L6cpmNnHyfRL1Anx37W8QSi/A/M0hQ0lYdtNfY2SwQL659x6jbu EML8joxJC31JjCTTD37T4Sewf1eI2HuZqA6deLE3vEc5O9aV3Lh9xdRzRUtuScT9/NUB 8cbEfmwSSqAfeitMYbq2neD3LHe+FuAzfeYH7wzDbW1CsQBg2/oDl3laSMQx3IVZqJ/P 5pEbQIgJxHPnq3YAJUB299MjTMTSeSXOtCg4kZJdkSaz0giBYIHQuUvnlVDpXwCLOCJH MW9J/CzvSJN6le7lJd4paMcIj9Ge0CRZxfQ0YcSx7XBL9md2hOO+Hf70Tqpw+poU0We8 4Fnw== MIME-Version: 1.0 Received: by 10.224.78.198 with SMTP id m6mr25348969qak.30.1344866267368; Mon, 13 Aug 2012 06:57:47 -0700 (PDT) Received: by 10.49.38.105 with HTTP; Mon, 13 Aug 2012 06:57:47 -0700 (PDT) In-Reply-To: References: Date: Mon, 13 Aug 2012 15:57:47 +0200 Message-ID: Subject: Re: how to enhance job start up speed? From: Bertrand Dechoux To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf3074d996230b2004c72614df X-Virus-Checked: Checked by ClamAV on apache.org --20cf3074d996230b2004c72614df Content-Type: text/plain; charset=ISO-8859-1 I am not sure to understand and I guess I am not the only one. 1) What's a worker in your context? Only the logic inside your Mapper or something else? 2) You should clarify your cases. You seem to have two cases but both are in overhead so I am assuming there is a baseline? Hadoop vs sequential, so sequential is not Hadoop? 3) What are the size of the file? Bertrand On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke < matthias.mk.kricke@gmail.com> wrote: > Hello all, > > I'm using CDH3u3. > If I want to process one File, set to non splitable hadoop starts one > Mapper and no Reducer (thats ok for this test scenario). The Mapper > goes through a configuration step where some variables for the worker > inside the mapper are initialized. > Now the Mapper gives me K,V-pairs, which are lines of an input file. I > process the V with the worker. > > When I compare the run time of hadoop to the run time of the same process > in sequentiell manner, I get: > > worker time --> same in both cases > > case: mapper --> overhead of ~32% to the worker process (same for bigger > chunk size) > case: sequentiell --> overhead of ~15% to the worker process > > It shouldn't be that much slower, because of non splitable, the mapper > will be executed where the data is saved by HDFS, won't it? > Where did those 17% go? How to reduce this? Did hadoop needs the whole > time for reading or streaming the data out of HDFS? > > I would appreciate your help, > > Greetings > mk > > -- Bertrand Dechoux --20cf3074d996230b2004c72614df Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I am not sure to understand and I guess I am not the only one.

1) Wh= at's a worker in your context? Only the logic inside your Mapper or som= ething else?
2) You should clarify your cases. You seem to have two case= s but both are in overhead so I am assuming there is a baseline? Hadoop vs = sequential, so sequential is not Hadoop?
3) What are the size of the file?

Bertrand

On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <matt= hias.mk.kricke@gmail.com> wrote:
Hello all,

I'm using CDH3u3.
If I want to process on= e File, set to non splitable hadoop starts one Mapper and no Reducer (thats= ok for this test scenario). The Mapper
goes through a configurat= ion step where some variables for the worker inside the mapper are initiali= zed.
Now the Mapper gives me K,V-pairs, which are lines of an input file. I= process the V with the worker.

When I compare the= run time of hadoop to the run time of the same process in sequentiell mann= er, I get:

worker time --> same in both cases

case: mapper --> overhead of ~32% to the worker process (same fo= r bigger chunk size)
case: sequentiell --> overhead of ~15% to= the worker process

It shouldn't be that much slower, because of non sp= litable, the mapper will be executed where the data is saved by HDFS, won&#= 39;t it?
Where did those 17% go? How to reduce this? Did hadoop n= eeds the whole time for reading or streaming the data out of HDFS?

I would appreciate your help,

= Greetings
mk
=




--
Bertrand = Dechoux
--20cf3074d996230b2004c72614df--