Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAAfPi_Nnt10brJmDmn34qHa6f41CmyKkinTEB6381Nx6gjscww@mail.gmail.com>
References: 
 <CAAfPi_Nnt10brJmDmn34qHa6f41CmyKkinTEB6381Nx6gjscww@mail.gmail.com>
Date: Sun, 7 Jun 2015 17:34:35 -0700
Message-ID: 
 <CALte62zn7GnM56KbVBApKw6br=VsZvRw-_yzBmaZS78GwN8deg@mail.gmail.com>
Subject: Re: Advantage/disadvantage of dbm vs join vs HBase
From: Ted Yu <yuzhihong@gmail.com>
To: "common-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a1147f9446356880517f6ce95

--001a1147f9446356880517f6ce95
Content-Type: text/plain; charset=UTF-8

Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <ktt3ja@gmail.com> wrote:

> Hi,
>
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
>
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
>
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
>
> Sincerely,
> Kiet Tran
>

--001a1147f9446356880517f6ce95
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Do you have hbase running in your cluster ?<div><br></div>=
<div>I ask this because bringing HBase as a new component into your deploym=
ent incurs operational overhead which you may not be familiar with.</div><d=
iv><br></div><div>Cheers</div></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <span dir=3D"l=
tr">&lt;<a href=3D"mailto:ktt3ja@gmail.com" target=3D"_blank">ktt3ja@gmail.=
com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I have a roughly 5 GB file where each row is a key, value pair. I<br>
would like to use this as a &quot;hashmap&quot; against another large set o=
f<br>
file. From searching around, one way to do it would be to turn it into<br>
a dbm like DBD and put it into a distributed cache. Another is by<br>
joining the data. A third one is putting it into HBase and use it for<br>
lookup.<br>
<br>
I&#39;m more familiar with the first approach, so it seems simpler to me.<b=
r>
However, I have read that using a distributed cache for files beyond a<br>
few megabytes is not recommended because the file is replicated across<br>
all the data nodes. This doesn&#39;t seem that bad to me because I just<br>
pay this overhead once at the beginning of the job, and then each node<br>
gets a copy locally, right? If I were to go with join, would it not<br>
increase the workload (more entries) and create the same network<br>
congestion issue? And wouldn&#39;t going with HBase means making it a<br>
bottleneck?<br>
<br>
What&#39;s the advantage and disadvantage of going for one solution over<br=
>
the others? What if, for example, that &quot;hashmap&quot; needs to be from=
,<br>
say, a 40GB file. How would my option change? At which point would<br>
each option make sense?<br>
<br>
Sincerely,<br>
Kiet Tran<br>
</blockquote></div><br></div>

--001a1147f9446356880517f6ce95--