hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Advantage/disadvantage of dbm vs join vs HBase
Date Mon, 08 Jun 2015 00:34:35 GMT
Do you have hbase running in your cluster ?

I ask this because bringing HBase as a new component into your deployment
incurs operational overhead which you may not be familiar with.


On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <ktt3ja@gmail.com> wrote:

> Hi,
> I have a roughly 5 GB file where each row is a key, value pair. I
> would like to use this as a "hashmap" against another large set of
> file. From searching around, one way to do it would be to turn it into
> a dbm like DBD and put it into a distributed cache. Another is by
> joining the data. A third one is putting it into HBase and use it for
> lookup.
> I'm more familiar with the first approach, so it seems simpler to me.
> However, I have read that using a distributed cache for files beyond a
> few megabytes is not recommended because the file is replicated across
> all the data nodes. This doesn't seem that bad to me because I just
> pay this overhead once at the beginning of the job, and then each node
> gets a copy locally, right? If I were to go with join, would it not
> increase the workload (more entries) and create the same network
> congestion issue? And wouldn't going with HBase means making it a
> bottleneck?
> What's the advantage and disadvantage of going for one solution over
> the others? What if, for example, that "hashmap" needs to be from,
> say, a 40GB file. How would my option change? At which point would
> each option make sense?
> Sincerely,
> Kiet Tran

View raw message