Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0885C10A91 for ; Tue, 24 Sep 2013 22:17:11 +0000 (UTC) Received: (qmail 51394 invoked by uid 500); 24 Sep 2013 22:17:08 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 51049 invoked by uid 500); 24 Sep 2013 22:17:07 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 51040 invoked by uid 99); 24 Sep 2013 22:17:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Sep 2013 22:17:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of varun@pinterest.com designates 209.85.192.179 as permitted sender) Received: from [209.85.192.179] (HELO mail-pd0-f179.google.com) (209.85.192.179) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Sep 2013 22:16:59 +0000 Received: by mail-pd0-f179.google.com with SMTP id v10so5232577pde.10 for ; Tue, 24 Sep 2013 15:16:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=RwrDTSSjvkhWCWWaAej5sU4IomUewBK3cBRAUouGuQg=; b=VjqzAuVjxobT4sn0Gv2hkuankdwIBLyY0JH3s9/1YE8vKvGPw6RHdtYE+q3ztU5jgd mqKwzq7X1lJVgf278GODWn9Grsv07ya3g3hByPP0vKLwys/J4QqUpowm54yjPxAsxsvB aeUIOUJDYsbasdWT34SqQMCelFkGI9TypvFYBVQeDA8bTpY5Z1ha82cjXdTyFkwt/FoU pZ38Hyl0o4RsRpozQRIOzqVS9IwNN0KJPo/o5VZIM46oh8ZUqp4R6Q/wbXBEw1YkC1rC H7eEQ3mVt/McTb+RUh5apBIww6bumaPRHtPeejW8QPDUKFH1fjhdwi2fnQQfp7Vp2knR LqpA== X-Gm-Message-State: ALoCoQkyancS4NE7rTmT4uaLFaa2grRLcEFnlZlR3EZ3ZM9/f/O4jkvHU3VY9bUosQSWp78761Dm MIME-Version: 1.0 X-Received: by 10.68.189.229 with SMTP id gl5mr80586pbc.195.1380060998626; Tue, 24 Sep 2013 15:16:38 -0700 (PDT) Received: by 10.70.48.176 with HTTP; Tue, 24 Sep 2013 15:16:38 -0700 (PDT) In-Reply-To: References: Date: Tue, 24 Sep 2013 15:16:38 -0700 Message-ID: Subject: Re: Is there a problem with having 4000 tables in a cluster? From: Varun Sharma To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=e89a8f642b5297868d04e7287d04 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f642b5297868d04e7287d04 Content-Type: text/plain; charset=ISO-8859-1 Its better to do some "salting" in your keys for the reduce phase. Basically, make ur key be something like "KeyHash + Key" and then decode it in your reducer and write to HBase. This way you avoid the hotspotting problem on HBase due to MapReduce sorting. On Tue, Sep 24, 2013 at 2:50 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org> wrote: > Hi Jeremy, > > I don't see any issue for HBase to handle 4000 tables. However, I don't > think it's the best solution for your use case. > > JM > > > 2013/9/24 jeremy p > > > Short description : I'd like to have 4000 tables in my HBase cluster. > Will > > this be a problem? In general, what problems do you run into when you > try > > to host thousands of tables in a cluster? > > > > Long description : I'd like the performance advantage of pre-split > tables, > > and I'd also like to do filtered range scans. Imagine a keyspace where > the > > key consists of : [POSITION]_[WORD] , where POSITION is a number from 1 > to > > 4000, and WORD is a string consisting of 96 characters. The value in the > > cell would be a single integer. My app will examine a 'document', where > > each 'line' consists of 4000 WORDs. For each WORD, it'll do a filtered > > regex lookup. Only problem? Say I have 200 mappers and they all start > at > > POSITION 1, my region servers would get hotspotted like crazy. So my idea > > is to break it into 4000 tables (one for each POSITION), and then > pre-split > > the tables such that each region gets an equal amount of the traffic. In > > this scenario, the key would just be WORD. Dunno if this a bad idea, > would > > be open to suggestions > > > > Thanks! > > > > --J > > > --e89a8f642b5297868d04e7287d04--