Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of dechouxb@gmail.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr1Wpgt9Gpj2=DUC+8oiyZCotiNPcbkNvjxRmEegJXfLRA@mail.gmail.com>
References: 
 <CAE636z_Dd9AkDupCyUmSZ-RuaxUvw9RU8SKCn8QVNsXkxMhRiA@mail.gmail.com>
	<5B24054F-762B-43EA-824F-9E0641B84584@123.org>
	<CAE636z8fAcBpM9GVbaCK5TOLvbZG3n5mfR40ARNupo-CamrLoA@mail.gmail.com>
	<CAOcnVr1Wpgt9Gpj2=DUC+8oiyZCotiNPcbkNvjxRmEegJXfLRA@mail.gmail.com>
Date: Fri, 14 Sep 2012 09:31:23 +0200
Message-ID: 
 <CAO6W-2fwaVj4OQ+taKKR2rHYGN-NC0jfYMEsj8S+2vFctxs4mw@mail.gmail.com>
Subject: Re: What's the basic idea of pseudo-distributed Hadoop ?
From: Bertrand Dechoux <dechouxb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf303b3f452bad1604c9a46985

--20cf303b3f452bad1604c9a46985
Content-Type: text/plain; charset=ISO-8859-1

The only difference between pseudo-distributed and fully distributed would
be scale. You could say that code that runs fine on the former, runs fine
too on the latter. But it does not necessary mean that the performance will
scale the same way (ie if you keep a list of elements in memory, at bigger
scale you could receive OOME).

Of course, like it has been implied in previous answers, you can't say the
same with standalone. With this mode, you could use a global mutable static
state thinking it's fine without caring about distribution between the
nodes. In that case, the same code launched on pseudo-distributed will fail
to replicate the same results.

Regards

Bertrand

On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <harsh@cloudera.com> wrote:

> Hi Jason,
>
> I think you're confusing the standalone mode with a pseudo-distributed
> mode. The former is a limited mode of MR where no daemons need to be
> deployed and the tasks run in a single JVM (via threads).
>
> A pseudo distributed cluster is a cluster where all daemons are
> running on one node itself. Hence, not "distributed" in the sense of
> multi-nodes (no use of an network gear) but works in the same way
> between nodes (RPC, etc.) as a fully-distributed one.
>
> If an MR program works fine in a pseudo-distributed mode, it "should"
> work (no guarantee) fine in a fully-distributed mode iff all nodes
> have the same arch/OS, same JVM, and job-specific configurations. This
> is because tasks execute on various nodes and may be affected by the
> node's behavior or setup that is different from others - and thats
> something you'd have to detect/know about if it exhibits failures more
> than others.
>
> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <lin.yang.jason@gmail.com>
> wrote:
> > Hey, Kai
> >
> > Thanks for you reply.
> >
> > I was wondering what's difference btw the pseudo-distributed and
> > fully-distributed hadoop, except the maximum number of map/reduce.
> >
> > And if a MR program works fine in pseudo-distributed cluster, will it
> work
> > exactly fine in the fully-distributed cluster ?
> >
> >
> > 2012/9/14 Kai Voigt <k@123.org>
> >>
> >> e default setting is that a tasktracker can run up to two map and reduce
> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see some
> >> concurrency on your one machine.
> >
> >
> >
> >
> > --
> > YANG, Lin
> >
>
>
>
> --
> Harsh J
>


-- 
Bertrand Dechoux

--20cf303b3f452bad1604c9a46985
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The only difference between pseudo-distributed and fully distributed would =
be scale. You could say that code that runs fine on the former, runs fine t=
oo on the latter. But it does not necessary mean that the performance will =
scale the same way (ie if you keep a list of elements in memory, at bigger =
scale you could receive OOME).<br>
<br>Of course, like it has been implied in previous answers, you can&#39;t =
say the same with standalone. With this mode, you could use a global mutabl=
e static state thinking it&#39;s fine without caring about distribution bet=
ween the nodes. In that case, the same code launched on pseudo-distributed =
will fail to replicate the same results.<br>
<br>Regards<br><br>Bertrand<br><br><div class=3D"gmail_quote">On Fri, Sep 1=
4, 2012 at 9:24 AM, Harsh J <span dir=3D"ltr">&lt;<a href=3D"mailto:harsh@c=
loudera.com" target=3D"_blank">harsh@cloudera.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Hi Jason,<br>
<br>
I think you&#39;re confusing the standalone mode with a pseudo-distributed<=
br>
mode. The former is a limited mode of MR where no daemons need to be<br>
deployed and the tasks run in a single JVM (via threads).<br>
<br>
A pseudo distributed cluster is a cluster where all daemons are<br>
running on one node itself. Hence, not &quot;distributed&quot; in the sense=
 of<br>
multi-nodes (no use of an network gear) but works in the same way<br>
between nodes (RPC, etc.) as a fully-distributed one.<br>
<br>
If an MR program works fine in a pseudo-distributed mode, it &quot;should&q=
uot;<br>
work (no guarantee) fine in a fully-distributed mode iff all nodes<br>
have the same arch/OS, same JVM, and job-specific configurations. This<br>
is because tasks execute on various nodes and may be affected by the<br>
node&#39;s behavior or setup that is different from others - and thats<br>
something you&#39;d have to detect/know about if it exhibits failures more<=
br>
than others.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang &lt;<a href=3D"mailto:lin.yang=
.jason@gmail.com">lin.yang.jason@gmail.com</a>&gt; wrote:<br>
&gt; Hey, Kai<br>
&gt;<br>
&gt; Thanks for you reply.<br>
&gt;<br>
&gt; I was wondering what&#39;s difference btw the pseudo-distributed and<b=
r>
&gt; fully-distributed hadoop, except the maximum number of map/reduce.<br>
&gt;<br>
&gt; And if a MR program works fine in pseudo-distributed cluster, will it =
work<br>
&gt; exactly fine in the fully-distributed cluster ?<br>
&gt;<br>
&gt;<br>
&gt; 2012/9/14 Kai Voigt &lt;<a href=3D"mailto:k@123.org">k@123.org</a>&gt;=
<br>
&gt;&gt;<br>
&gt;&gt; e default setting is that a tasktracker can run up to two map and =
reduce<br>
&gt;&gt; tasks in parallel (mapred.tasktracker.map.tasks.maximum and<br>
&gt;&gt; mapred.tasktracker.reduce.tasks.maximum), so you will actually see=
 some<br>
&gt;&gt; concurrency on your one machine.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; YANG, Lin<br>
&gt;<br>
<br>
<br>
<br>
</div></div><span class=3D"HOEnZb"><font color=3D"#888888">--<br>
Harsh J<br>
</font></span></blockquote></div><br><br clear=3D"all"><br>-- <br>Bertrand =
Dechoux<br>

--20cf303b3f452bad1604c9a46985--