Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4198EDEE6 for ; Fri, 14 Sep 2012 07:31:56 +0000 (UTC) Received: (qmail 99414 invoked by uid 500); 14 Sep 2012 07:31:51 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 99319 invoked by uid 500); 14 Sep 2012 07:31:51 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 99297 invoked by uid 99); 14 Sep 2012 07:31:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Sep 2012 07:31:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dechouxb@gmail.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Sep 2012 07:31:43 +0000 Received: by qcsc21 with SMTP id c21so3396280qcs.35 for ; Fri, 14 Sep 2012 00:31:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=EToGHnJqKHF0ZLGViyaku2btplESkwVDO0C33BAqEiI=; b=kaBLhq0EbJ/N0ZITI1qzYM0MgyCe8DGRXxmctFLJ93jFR/2G6oxHXiGi8RsUCrW4Kl Y+7vSfkQHDL8suNw//pN886ZVNDbp0LKxSnzsktEM7GnlCX2SdaNDJ+e01kOiU6L5e66 qmB4+aoqHyCPCxMlEHLnbI/EE00YHvDhXDW+jYxLS/JkDbP5AUx+PdejO0yn4k3k7klB jLMBXJLtOo1sef+ZA0LqygEC5jExsEKj11jQjO8ohnFcf1Uj1Efunfze5nZT48N48LsG kYt+jl8DbURL4JkEe9k21ornIDyd3ycG0YEja/BBE+r/UwV0BgKxSPN+Z+/li66lUqyF yCsQ== MIME-Version: 1.0 Received: by 10.224.183.18 with SMTP id ce18mr5125443qab.90.1347607883132; Fri, 14 Sep 2012 00:31:23 -0700 (PDT) Received: by 10.49.71.231 with HTTP; Fri, 14 Sep 2012 00:31:23 -0700 (PDT) In-Reply-To: References: <5B24054F-762B-43EA-824F-9E0641B84584@123.org> Date: Fri, 14 Sep 2012 09:31:23 +0200 Message-ID: Subject: Re: What's the basic idea of pseudo-distributed Hadoop ? From: Bertrand Dechoux To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf303b3f452bad1604c9a46985 --20cf303b3f452bad1604c9a46985 Content-Type: text/plain; charset=ISO-8859-1 The only difference between pseudo-distributed and fully distributed would be scale. You could say that code that runs fine on the former, runs fine too on the latter. But it does not necessary mean that the performance will scale the same way (ie if you keep a list of elements in memory, at bigger scale you could receive OOME). Of course, like it has been implied in previous answers, you can't say the same with standalone. With this mode, you could use a global mutable static state thinking it's fine without caring about distribution between the nodes. In that case, the same code launched on pseudo-distributed will fail to replicate the same results. Regards Bertrand On Fri, Sep 14, 2012 at 9:24 AM, Harsh J wrote: > Hi Jason, > > I think you're confusing the standalone mode with a pseudo-distributed > mode. The former is a limited mode of MR where no daemons need to be > deployed and the tasks run in a single JVM (via threads). > > A pseudo distributed cluster is a cluster where all daemons are > running on one node itself. Hence, not "distributed" in the sense of > multi-nodes (no use of an network gear) but works in the same way > between nodes (RPC, etc.) as a fully-distributed one. > > If an MR program works fine in a pseudo-distributed mode, it "should" > work (no guarantee) fine in a fully-distributed mode iff all nodes > have the same arch/OS, same JVM, and job-specific configurations. This > is because tasks execute on various nodes and may be affected by the > node's behavior or setup that is different from others - and thats > something you'd have to detect/know about if it exhibits failures more > than others. > > On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang > wrote: > > Hey, Kai > > > > Thanks for you reply. > > > > I was wondering what's difference btw the pseudo-distributed and > > fully-distributed hadoop, except the maximum number of map/reduce. > > > > And if a MR program works fine in pseudo-distributed cluster, will it > work > > exactly fine in the fully-distributed cluster ? > > > > > > 2012/9/14 Kai Voigt > >> > >> e default setting is that a tasktracker can run up to two map and reduce > >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and > >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some > >> concurrency on your one machine. > > > > > > > > > > -- > > YANG, Lin > > > > > > -- > Harsh J > -- Bertrand Dechoux --20cf303b3f452bad1604c9a46985 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The only difference between pseudo-distributed and fully distributed would = be scale. You could say that code that runs fine on the former, runs fine t= oo on the latter. But it does not necessary mean that the performance will = scale the same way (ie if you keep a list of elements in memory, at bigger = scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't = say the same with standalone. With this mode, you could use a global mutabl= e static state thinking it's fine without caring about distribution bet= ween the nodes. In that case, the same code launched on pseudo-distributed = will fail to replicate the same results.

Regards

Bertrand

On Fri, Sep 1= 4, 2012 at 9:24 AM, Harsh J <harsh@cloudera.com> wrote:
=
Hi Jason,

I think you're confusing the standalone mode with a pseudo-distributed<= br> mode. The former is a limited mode of MR where no daemons need to be
deployed and the tasks run in a single JVM (via threads).

A pseudo distributed cluster is a cluster where all daemons are
running on one node itself. Hence, not "distributed" in the sense= of
multi-nodes (no use of an network gear) but works in the same way
between nodes (RPC, etc.) as a fully-distributed one.

If an MR program works fine in a pseudo-distributed mode, it "should&q= uot;
work (no guarantee) fine in a fully-distributed mode iff all nodes
have the same arch/OS, same JVM, and job-specific configurations. This
is because tasks execute on various nodes and may be affected by the
node's behavior or setup that is different from others - and thats
something you'd have to detect/know about if it exhibits failures more<= br> than others.

On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <lin.yang.jason@gmail.com> wrote:
> Hey, Kai
>
> Thanks for you reply.
>
> I was wondering what's difference btw the pseudo-distributed and > fully-distributed hadoop, except the maximum number of map/reduce.
>
> And if a MR program works fine in pseudo-distributed cluster, will it = work
> exactly fine in the fully-distributed cluster ?
>
>
> 2012/9/14 Kai Voigt <k@123.org>=
>>
>> e default setting is that a tasktracker can run up to two map and = reduce
>> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>> mapred.tasktracker.reduce.tasks.maximum), so you will actually see= some
>> concurrency on your one machine.
>
>
>
>
> --
> YANG, Lin
>



--
Harsh J



--
Bertrand = Dechoux
--20cf303b3f452bad1604c9a46985--