hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiet Tran <ktt...@gmail.com>
Subject Advantage/disadvantage of dbm vs join vs HBase
Date Sun, 07 Jun 2015 21:53:16 GMT

I have a roughly 5 GB file where each row is a key, value pair. I
would like to use this as a "hashmap" against another large set of
file. From searching around, one way to do it would be to turn it into
a dbm like DBD and put it into a distributed cache. Another is by
joining the data. A third one is putting it into HBase and use it for

I'm more familiar with the first approach, so it seems simpler to me.
However, I have read that using a distributed cache for files beyond a
few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just
pay this overhead once at the beginning of the job, and then each node
gets a copy locally, right? If I were to go with join, would it not
increase the workload (more entries) and create the same network
congestion issue? And wouldn't going with HBase means making it a

What's the advantage and disadvantage of going for one solution over
the others? What if, for example, that "hashmap" needs to be from,
say, a 40GB file. How would my option change? At which point would
each option make sense?

Kiet Tran

View raw message