Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dechouxb@gmail.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr2=SkQyhkEdeTFNkm-+JL3QJVrGCv33nc565C_hcGD-CA@mail.gmail.com>
References: <5085AC9B.4070504@gmail.com>
	<1230403896-1350973318-cardhu_decombobulator_blackberry.rim.net-1541577741-@b27.c16.bise7.blackberry>
	<CABVuHihxm40PWjyksRo69=R=G5DQtZ=QHH4-dxMf-z59P8124w@mail.gmail.com>
	<CAOcnVr2=SkQyhkEdeTFNkm-+JL3QJVrGCv33nc565C_hcGD-CA@mail.gmail.com>
Date: Thu, 25 Oct 2012 09:17:22 +0200
Message-ID: 
 <CAO6W-2cj2ixKJxWzkXRwvscsYJ2b_aMt=uJXsPoa+-dVfEopcA@mail.gmail.com>
Subject: Re: Data locality of map-side join
From: Bertrand Dechoux <dechouxb@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf3074b3aa8cd66a04ccdcfe46

--20cf3074b3aa8cd66a04ccdcfe46
Content-Type: text/plain; charset=ISO-8859-1

One underlying issue is that you would like your tool to be able to detect
which dataset is the largest and how large is it because with this
information different strategies can be chosen. This implies that somehow
your tool needs to create/keep/update statistics about your datasets. And
that's clearly something which is relevant for an external tool (like hive
or pig) but it might not make sense to build that into the core
mapred/mapreduce. That would increase coupling for something which is not
necessarily relevant for the core of the platform.

I know about about Hive. And you could be interested in reading more about
it.
https://cwiki.apache.org/Hive/statsdev.html

Statistics such as the number of rows of a table or partition and the
> histograms of a particular interesting column are important in many ways.
> One of the key use cases of statistics is query optimization. Statistics
> serve as the input to the cost functions of the optimizer so that it can
> compare different plans and choose among them. Statistics may sometimes
> meet the purpose of the users' queries. Users can quickly get the answers
> for some of their queries by only querying stored statistics rather than
> firing long-running execution plans. Some examples are getting the quantile
> of the users' age distribution, the top 10 apps that are used by people,
> and the number of distinct sessions.
>

I don't know if Pig has something similar.

Regards

Bertrand


On Thu, Oct 25, 2012 at 7:49 AM, Harsh J <harsh@cloudera.com> wrote:

> Hi Sigurd,
>
> From what I've generally noticed, the client-end frameworks (Hive,
> Pig, etc.) have gotten much more cleverness and efficiency packed in
> their join parts than the MR join package which probably exists to
> serve as an example or utility today more than anything else (but
> works well for what it does).
>
> Per the code in the join package, there are no such estimates made
> today. There is zero use of DistributedCache - the only decisions are
> made based on the expression (i.e. to select which form of joining
> record reader to use).
>
> Enhancements to this may be accepted though, so feel free to file some
> JIRAs if you have something to suggest/contribute. Hopefully one day
> we could have a unified library between client-end tools for common
> use-cases such as joins, etc. over MR, but there isn't such a thing
> right now (AFAIK).
>
> On Tue, Oct 23, 2012 at 2:52 PM, Sigurd Spieckermann
> <sigurd.spieckermann@gmail.com> wrote:
> > Interesting to know that Hive and Pig are doing something in this
> direction.
> > I'm dealing with the Hadoop join-package which doesn't use
> DistributedCache
> > though but it rather pulls the other partition over the network before
> > launching the map task. This is under the assumption that both partitions
> > are too big to load into DC or it's just undesirable to use DC. Is there
> a
> > similar mechanism implemented in the join-package that considers the
> size of
> > the two partitions to be joined trying to execute the map task on the
> > datanode that holds the bigger partition?
> >
> >
> > 2012/10/23 Bejoy KS <bejoy.hadoop@gmail.com>
> >>
> >> Hi Sigurd
> >>
> >> Mapside joins are efficiently implemented in Hive and Pig. I'm talking
> in
> >> terms of how mapside joins are implemented in hive.
> >>
> >> In map side join, the smaller data set is first loaded into
> >> DistributedCache. The larger dataset is streamed as usual and the
> smaller
> >> dataset in memory. For every record in larger data set the look up is
> made
> >> in memory on the smaller set and there by joins are done.
> >>
> >> In later versions of hive the hive framework itself intelligently
> >> determines the smaller data set. In older versions you can specify the
> >> smaller data set using some hints in query.
> >>
> >>
> >> Regards
> >> Bejoy KS
> >>
> >> Sent from handheld, please excuse typos.
> >>
> >> -----Original Message-----
> >> From: Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
> >> Date: Mon, 22 Oct 2012 22:29:15
> >> To: <user@hadoop.apache.org>
> >> Reply-To: user@hadoop.apache.org
> >> Subject: Data locality of map-side join
> >>
> >> Hi guys,
> >>
> >> I've been trying to figure out whether a map-side join using the
> >> join-package does anything clever regarding data locality with respect
> >> to at least one of the partitions to join. To be more specific, if I
> >> want to join two datasets and some partition of dataset A is larger than
> >> the corresponding partition of dataset B, does Hadoop account for this
> >> and try to ensure that the map task is executed on the datanode storing
> >> the bigger partition thus reducing data transfer (if the other partition
> >> does not happen to be located on that same datanode)? I couldn't
> >> conclude the one or the other behavior from the source code and I
> >> couldn't find any documentation about this detail.
> >>
> >> Thanks for clarifying!
> >> Sigurd
> >
> >
>
>
>
> --
> Harsh J
>


-- 
Bertrand Dechoux

--20cf3074b3aa8cd66a04ccdcfe46
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

One underlying issue is that you would like your tool to be able to detect =
which dataset is the largest and how large is it because with this informat=
ion different strategies can be chosen. This implies that somehow your tool=
 needs to create/keep/update statistics about your datasets. And that&#39;s=
 clearly something which is relevant for an external tool (like hive or pig=
) but it might not make sense to build that into the core mapred/mapreduce.=
 That would increase coupling for something which is not necessarily releva=
nt for the core of the platform.<br>
<br>I know about about Hive. And you could be interested in reading more ab=
out it.<br><a href=3D"https://cwiki.apache.org/Hive/statsdev.html">https://=
cwiki.apache.org/Hive/statsdev.html</a><br><br><blockquote style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" =
class=3D"gmail_quote">
Statistics such as the number of rows of a table or partition and the=20
histograms of a particular interesting column are important in many=20
ways. One of the key use cases of statistics is query optimization.=20
Statistics serve as the input to the cost functions of the optimizer so=20
that it can compare different plans and choose among them. Statistics=20
may sometimes meet the purpose of the users&#39; queries. Users can quickly=
=20
get the answers for some of their queries by only querying stored=20
statistics rather than firing long-running execution plans. Some=20
examples are getting the quantile of the users&#39; age distribution, the=
=20
top 10 apps that are used by people, and the number of distinct=20
sessions.<br></blockquote><div><br>I don&#39;t know if Pig has something si=
milar.<br><br>Regards<br><br>Bertrand <br></div><br><br><div class=3D"gmail=
_quote">On Thu, Oct 25, 2012 at 7:49 AM, Harsh J <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:harsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>&=
gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Sigurd,<br>
<br>
>From what I&#39;ve generally noticed, the client-end frameworks (Hive,<br>
Pig, etc.) have gotten much more cleverness and efficiency packed in<br>
their join parts than the MR join package which probably exists to<br>
serve as an example or utility today more than anything else (but<br>
works well for what it does).<br>
<br>
Per the code in the join package, there are no such estimates made<br>
today. There is zero use of DistributedCache - the only decisions are<br>
made based on the expression (i.e. to select which form of joining<br>
record reader to use).<br>
<br>
Enhancements to this may be accepted though, so feel free to file some<br>
JIRAs if you have something to suggest/contribute. Hopefully one day<br>
we could have a unified library between client-end tools for common<br>
use-cases such as joins, etc. over MR, but there isn&#39;t such a thing<br>
right now (AFAIK).<br>
<br>
On Tue, Oct 23, 2012 at 2:52 PM, Sigurd Spieckermann<br>
&lt;<a href=3D"mailto:sigurd.spieckermann@gmail.com">sigurd.spieckermann@gm=
ail.com</a>&gt; wrote:<br>
&gt; Interesting to know that Hive and Pig are doing something in this dire=
ction.<br>
&gt; I&#39;m dealing with the Hadoop join-package which doesn&#39;t use Dis=
tributedCache<br>
&gt; though but it rather pulls the other partition over the network before=
<br>
&gt; launching the map task. This is under the assumption that both partiti=
ons<br>
&gt; are too big to load into DC or it&#39;s just undesirable to use DC. Is=
 there a<br>
&gt; similar mechanism implemented in the join-package that considers the s=
ize of<br>
&gt; the two partitions to be joined trying to execute the map task on the<=
br>
&gt; datanode that holds the bigger partition?<br>
&gt;<br>
&gt;<br>
&gt; 2012/10/23 Bejoy KS &lt;<a href=3D"mailto:bejoy.hadoop@gmail.com">bejo=
y.hadoop@gmail.com</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; Hi Sigurd<br>
&gt;&gt;<br>
&gt;&gt; Mapside joins are efficiently implemented in Hive and Pig. I&#39;m=
 talking in<br>
&gt;&gt; terms of how mapside joins are implemented in hive.<br>
&gt;&gt;<br>
&gt;&gt; In map side join, the smaller data set is first loaded into<br>
&gt;&gt; DistributedCache. The larger dataset is streamed as usual and the =
smaller<br>
&gt;&gt; dataset in memory. For every record in larger data set the look up=
 is made<br>
&gt;&gt; in memory on the smaller set and there by joins are done.<br>
&gt;&gt;<br>
&gt;&gt; In later versions of hive the hive framework itself intelligently<=
br>
&gt;&gt; determines the smaller data set. In older versions you can specify=
 the<br>
&gt;&gt; smaller data set using some hints in query.<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Regards<br>
&gt;&gt; Bejoy KS<br>
&gt;&gt;<br>
&gt;&gt; Sent from handheld, please excuse typos.<br>
&gt;&gt;<br>
&gt;&gt; -----Original Message-----<br>
&gt;&gt; From: Sigurd Spieckermann &lt;<a href=3D"mailto:sigurd.spieckerman=
n@gmail.com">sigurd.spieckermann@gmail.com</a>&gt;<br>
&gt;&gt; Date: Mon, 22 Oct 2012 22:29:15<br>
&gt;&gt; To: &lt;<a href=3D"mailto:user@hadoop.apache.org">user@hadoop.apac=
he.org</a>&gt;<br>
&gt;&gt; Reply-To: <a href=3D"mailto:user@hadoop.apache.org">user@hadoop.ap=
ache.org</a><br>
&gt;&gt; Subject: Data locality of map-side join<br>
&gt;&gt;<br>
&gt;&gt; Hi guys,<br>
&gt;&gt;<br>
&gt;&gt; I&#39;ve been trying to figure out whether a map-side join using t=
he<br>
&gt;&gt; join-package does anything clever regarding data locality with res=
pect<br>
&gt;&gt; to at least one of the partitions to join. To be more specific, if=
 I<br>
&gt;&gt; want to join two datasets and some partition of dataset A is large=
r than<br>
&gt;&gt; the corresponding partition of dataset B, does Hadoop account for =
this<br>
&gt;&gt; and try to ensure that the map task is executed on the datanode st=
oring<br>
&gt;&gt; the bigger partition thus reducing data transfer (if the other par=
tition<br>
&gt;&gt; does not happen to be located on that same datanode)? I couldn&#39=
;t<br>
&gt;&gt; conclude the one or the other behavior from the source code and I<=
br>
&gt;&gt; couldn&#39;t find any documentation about this detail.<br>
&gt;&gt;<br>
&gt;&gt; Thanks for clarifying!<br>
&gt;&gt; Sigurd<br>
&gt;<br>
&gt;<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
<br>
<br>
--<br>
Harsh J<br>
</font></span></blockquote></div><br><br clear=3D"all"><br>-- <br>Bertrand =
Dechoux<br>

--20cf3074b3aa8cd66a04ccdcfe46--