Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AAFCE9BED for ; Mon, 2 Apr 2012 20:11:19 +0000 (UTC) Received: (qmail 8079 invoked by uid 500); 2 Apr 2012 20:11:17 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8025 invoked by uid 500); 2 Apr 2012 20:11:17 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 8017 invoked by uid 99); 2 Apr 2012 20:11:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 20:11:17 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of derek@klout.com designates 74.125.149.205 as permitted sender) Received: from [74.125.149.205] (HELO na3sys009aog111.obsmtp.com) (74.125.149.205) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 02 Apr 2012 20:11:10 +0000 Received: from mail-gx0-f173.google.com ([209.85.161.173]) (using TLSv1) by na3sys009aob111.postini.com ([74.125.148.12]) with SMTP ID DSNKT3oHxzEk7Ql81jVTZ5eoUnymH5cBrKNK@postini.com; Mon, 02 Apr 2012 13:10:49 PDT Received: by ggnp2 with SMTP id p2so1664984ggn.32 for ; Mon, 02 Apr 2012 13:10:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :x-gm-message-state; bh=bZpfFV8HAvqcRx4Jtay9100WsIJTAwIqtGF8Fhn0cAM=; b=mBJid+TdTSYsV+5dCjwEZtqojDhH5kUm2UjmmWJFje0oyPgyIp0kNUpoVXcP/UWJ3q wZVdsjaSH2j9g4giubLydXRugI3hNA1ILchOn3w7zx5tLzsOg+G1RtP2MSGZZ66KNftW 1yDgL3leZYlCGneSOwrYqdCBN+JsxMqtNDvw/YM+Rd37UhNfT7J/zCaIArHDftUC4tLd 7LDMZQQG++xKuoQPyCp1jXZyFIH2J0I5E+DYY9zby+TeSyky7oVL5BOX71A11MGd0yYp f2q5IAWeo9YUXFP7bsjzB3wES/E3g028n4kB8QVr5EL7I8+LxPr3hoZzPaFQTuuuy/OR yB0Q== MIME-Version: 1.0 Received: by 10.236.79.40 with SMTP id h28mr8414922yhe.50.1333397447075; Mon, 02 Apr 2012 13:10:47 -0700 (PDT) Received: by 10.236.116.65 with HTTP; Mon, 2 Apr 2012 13:10:47 -0700 (PDT) Date: Mon, 2 Apr 2012 13:10:47 -0700 Message-ID: Subject: Key Design Question for list data From: Derek Wollenstein To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf300fab4b2d66ea04bcb7c9d1 X-Gm-Message-State: ALoCoQmdOSIFbJXifOWfkp23srbtHcDN2+XIPnXn5u5YiYFwvCCwc550OxqRlvVa9r/BRcFWpW3y X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fab4b2d66ea04bcb7c9d1 Content-Type: text/plain; charset=ISO-8859-1 We're looking at how to store a large amount of (per-user) list data in hbase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like :"" (no value) :"" (no value) :"" (no value) The other option we hade was to do this entirely using :... :... where each row would contain multiple values. So in one case reading the first thirty values would be scan { STARTROW => 'FixedWidthUsername' LIMIT => 30} And in the second case it would be get 'FixedWidthUserName\x00\x00\x00\x00' The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution) The single-value format seems like it would take up more space on hbase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans? My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size. I've ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows). Thanks for help / suggestions / followup questions --Derek --20cf300fab4b2d66ea04bcb7c9d1--