hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiet Tran <ktt...@gmail.com>
Subject Re: Advantage/disadvantage of dbm vs join vs HBase
Date Mon, 08 Jun 2015 01:59:53 GMT
Nope. I have never used HBase before. I'm also new to Hadoop in
general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one
would do one thing vs another. Maybe it's something we can only tell
from experimenting around, but it sounds like a problem others have
ran into before.

Sincerely,
Kiet Tran

On Sun, Jun 7, 2015 at 8:34 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> Do you have hbase running in your cluster ?
>
> I ask this because bringing HBase as a new component into your deployment
> incurs operational overhead which you may not be familiar with.
>
> Cheers
>
> On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <ktt3ja@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a roughly 5 GB file where each row is a key, value pair. I
>> would like to use this as a "hashmap" against another large set of
>> file. From searching around, one way to do it would be to turn it into
>> a dbm like DBD and put it into a distributed cache. Another is by
>> joining the data. A third one is putting it into HBase and use it for
>> lookup.
>>
>> I'm more familiar with the first approach, so it seems simpler to me.
>> However, I have read that using a distributed cache for files beyond a
>> few megabytes is not recommended because the file is replicated across
>> all the data nodes. This doesn't seem that bad to me because I just
>> pay this overhead once at the beginning of the job, and then each node
>> gets a copy locally, right? If I were to go with join, would it not
>> increase the workload (more entries) and create the same network
>> congestion issue? And wouldn't going with HBase means making it a
>> bottleneck?
>>
>> What's the advantage and disadvantage of going for one solution over
>> the others? What if, for example, that "hashmap" needs to be from,
>> say, a 40GB file. How would my option change? At which point would
>> each option make sense?
>>
>> Sincerely,
>> Kiet Tran
>
>

Mime
View raw message