Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9819D18B46 for ; Sun, 7 Jun 2015 21:53:30 +0000 (UTC) Received: (qmail 95996 invoked by uid 500); 7 Jun 2015 21:53:25 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 95864 invoked by uid 500); 7 Jun 2015 21:53:25 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 95853 invoked by uid 99); 7 Jun 2015 21:53:25 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Jun 2015 21:53:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 263691A4785 for ; Sun, 7 Jun 2015 21:53:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id H6GsjnF-WQEz for ; Sun, 7 Jun 2015 21:53:24 +0000 (UTC) Received: from mail-qk0-f177.google.com (mail-qk0-f177.google.com [209.85.220.177]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id AB80921544 for ; Sun, 7 Jun 2015 21:53:23 +0000 (UTC) Received: by qkoo18 with SMTP id o18so69322387qko.1 for ; Sun, 07 Jun 2015 14:53:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=A+Xi7WLJyzdSsbaKJ4semCbqkKYIU1LeoB0Wx7Rd3pk=; b=AC5cCCiRjsjK1BEaPh+aToXQvo152Ad/QlBp9ppGHtcXwT7iKKg7S48uGjFZIDYRjy SJcY/vMkoNHBUp+O8i9vIAgX5mjiE4kyuNYPQ3TzscBSDCm+PQ+5+iWm3lZ0ORqJn53N fuLld9tXSAD0BR9NvH2jqyvSj+FcFVNNtuPSSUIquVL3YTFwusJi0T+FkelXcMX2kWIx R7GfpKsR3PMrF0CLxvutcrpkoWfnkxODsZXiBhUxtKyNmEOurbOnINKwV0U9NHWJZQ8G YnNWQfsAw6rcxKJn4gLQ2pfpaN2MFZqrdlbYIxlDxjcuHX5cBnpqa6zdM5MZzzddBNXx cjZw== MIME-Version: 1.0 X-Received: by 10.140.218.198 with SMTP id o189mr16554671qhb.47.1433713996517; Sun, 07 Jun 2015 14:53:16 -0700 (PDT) Received: by 10.140.94.144 with HTTP; Sun, 7 Jun 2015 14:53:16 -0700 (PDT) Date: Sun, 7 Jun 2015 17:53:16 -0400 Message-ID: Subject: Advantage/disadvantage of dbm vs join vs HBase From: Kiet Tran To: user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Hi, I have a roughly 5 GB file where each row is a key, value pair. I would like to use this as a "hashmap" against another large set of file. From searching around, one way to do it would be to turn it into a dbm like DBD and put it into a distributed cache. Another is by joining the data. A third one is putting it into HBase and use it for lookup. I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a few megabytes is not recommended because the file is replicated across all the data nodes. This doesn't seem that bad to me because I just pay this overhead once at the beginning of the job, and then each node gets a copy locally, right? If I were to go with join, would it not increase the workload (more entries) and create the same network congestion issue? And wouldn't going with HBase means making it a bottleneck? What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from, say, a 40GB file. How would my option change? At which point would each option make sense? Sincerely, Kiet Tran