Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBAF518DF8 for ; Mon, 8 Jun 2015 00:43:40 +0000 (UTC) Received: (qmail 21452 invoked by uid 500); 8 Jun 2015 00:43:35 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 21344 invoked by uid 500); 8 Jun 2015 00:43:34 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 21334 invoked by uid 99); 8 Jun 2015 00:43:34 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2015 00:43:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 507FDC2D85 for ; Mon, 8 Jun 2015 00:34:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id lsEOCgbU2qq5 for ; Mon, 8 Jun 2015 00:34:42 +0000 (UTC) Received: from mail-yk0-f178.google.com (mail-yk0-f178.google.com [209.85.160.178]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 9E9C527623 for ; Mon, 8 Jun 2015 00:34:41 +0000 (UTC) Received: by ykfr66 with SMTP id r66so45846978ykf.0 for ; Sun, 07 Jun 2015 17:34:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=RTlioHf4rRcS4etZ7Z453d6H+QIJ98YzN4DoagbOAX8=; b=FfjXLPROW2ovqNlCkDHCIA+PVeURe2ed9ItIv1DhorD5cwWyoBKiMRoehz9ysohUJ7 LB8RSPu2si1RoG4WLH5JdbIoxM8eUOeWwrpxHPYTxd6zWYkNdXkMey9UtLh/uVai0QUb 3fS6eHmdwod5HVP8H9fVk0pteY2ZjbcVXt9BnOXJsJwdRbe0/6o/QydjYlyQFTn+XC9h Me7GKC67lcHljoRGndm0ftD8AVujhdXZ27BZu3sCQXvYzEr3CCpyuTeviPxQkJJaeVSK K3W4LSBV/ktPVYXmav7yj67xEQbJN7muW0n1GtnNONz6brOLUaJWWXUVnmqqPFf7owrK 3zeg== MIME-Version: 1.0 X-Received: by 10.129.107.194 with SMTP id g185mr13243068ywc.101.1433723675562; Sun, 07 Jun 2015 17:34:35 -0700 (PDT) Received: by 10.37.223.73 with HTTP; Sun, 7 Jun 2015 17:34:35 -0700 (PDT) In-Reply-To: References: Date: Sun, 7 Jun 2015 17:34:35 -0700 Message-ID: Subject: Re: Advantage/disadvantage of dbm vs join vs HBase From: Ted Yu To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a1147f9446356880517f6ce95 --001a1147f9446356880517f6ce95 Content-Type: text/plain; charset=UTF-8 Do you have hbase running in your cluster ? I ask this because bringing HBase as a new component into your deployment incurs operational overhead which you may not be familiar with. Cheers On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran wrote: > Hi, > > I have a roughly 5 GB file where each row is a key, value pair. I > would like to use this as a "hashmap" against another large set of > file. From searching around, one way to do it would be to turn it into > a dbm like DBD and put it into a distributed cache. Another is by > joining the data. A third one is putting it into HBase and use it for > lookup. > > I'm more familiar with the first approach, so it seems simpler to me. > However, I have read that using a distributed cache for files beyond a > few megabytes is not recommended because the file is replicated across > all the data nodes. This doesn't seem that bad to me because I just > pay this overhead once at the beginning of the job, and then each node > gets a copy locally, right? If I were to go with join, would it not > increase the workload (more entries) and create the same network > congestion issue? And wouldn't going with HBase means making it a > bottleneck? > > What's the advantage and disadvantage of going for one solution over > the others? What if, for example, that "hashmap" needs to be from, > say, a 40GB file. How would my option change? At which point would > each option make sense? > > Sincerely, > Kiet Tran > --001a1147f9446356880517f6ce95 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Do you have hbase running in your cluster ?

=
I ask this because bringing HBase as a new component into your deploym= ent incurs operational overhead which you may not be familiar with.

Cheers

On Sun, Jun 7, 2015 at 2:53 PM, Kiet Tran <ktt3ja@gmail.= com> wrote:
Hi,

I have a roughly 5 GB file where each row is a key, value pair. I
would like to use this as a "hashmap" against another large set o= f
file. From searching around, one way to do it would be to turn it into
a dbm like DBD and put it into a distributed cache. Another is by
joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a
few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just
pay this overhead once at the beginning of the job, and then each node
gets a copy locally, right? If I were to go with join, would it not
increase the workload (more entries) and create the same network
congestion issue? And wouldn't going with HBase means making it a
bottleneck?

What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from= ,
say, a 40GB file. How would my option change? At which point would
each option make sense?

Sincerely,
Kiet Tran

--001a1147f9446356880517f6ce95--