Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
Date: Sun, 7 Jun 2015 17:53:16 -0400
Message-ID: 
 <CAAfPi_Nnt10brJmDmn34qHa6f41CmyKkinTEB6381Nx6gjscww@mail.gmail.com>
Subject: Advantage/disadvantage of dbm vs join vs HBase
From: Kiet Tran <ktt3ja@gmail.com>
To: user@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8

Hi,

I have a roughly 5 GB file where each row is a key, value pair. I
would like to use this as a "hashmap" against another large set of
file. From searching around, one way to do it would be to turn it into
a dbm like DBD and put it into a distributed cache. Another is by
joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me.
However, I have read that using a distributed cache for files beyond a
few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just
pay this overhead once at the beginning of the job, and then each node
gets a copy locally, right? If I were to go with join, would it not
increase the workload (more entries) and create the same network
congestion issue? And wouldn't going with HBase means making it a
bottleneck?

What's the advantage and disadvantage of going for one solution over
the others? What if, for example, that "hashmap" needs to be from,
say, a 40GB file. How would my option change? At which point would
each option make sense?

Sincerely,
Kiet Tran