Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C92AE9F2E for ; Thu, 7 Jun 2012 14:14:14 +0000 (UTC) Received: (qmail 56279 invoked by uid 500); 7 Jun 2012 14:14:14 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 56259 invoked by uid 500); 7 Jun 2012 14:14:14 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 56251 invoked by uid 99); 7 Jun 2012 14:14:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jun 2012 14:14:14 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [192.101.109.35] (HELO emailgw04.pnl.gov) (192.101.109.35) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jun 2012 14:14:05 +0000 Received: from emailhub01.pnl.gov ([130.20.251.61]) by emailgw04.pnl.gov with ESMTP/TLS/AES128-SHA; 07 Jun 2012 07:13:44 -0700 Received: from email06.pnl.gov ([130.20.251.71]) by emailhub01.pnl.gov ([130.20.251.61]) with mapi; Thu, 7 Jun 2012 07:13:44 -0700 From: "Perko, Ralph J" To: "user@accumulo.apache.org" Date: Thu, 7 Jun 2012 07:12:55 -0700 Subject: Re: Table design Thread-Topic: Table design Thread-Index: Ac1Et762oksi2AUWS6Ojis5mUcPHDg== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.2.120421 acceptlanguage: en-US Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Excellent information =96 thanks __________________________________________________ Ralph Perko Pacific Northwest National Laboratory From: Eric Newton > Reply-To: "user@accumulo.apache.org" > To: "user@accumulo.apache.org" > Subject: Re: Table design Some thoughts: Accumulo will accomodate keys that are very large (like 100K) but I don't r= ecommend it. It makes indexes big and slows down just about every operation= . A row-id or column qualifier that is 200 bytes long is not extreme. Rem= ember that compression will decrease the storage requirements, especially s= ince the sort creates natural redundancy in the row id. Is it important to find "Three men and a baby" just after "Three little pig= s"? If not, hash the title and look up the hash. That will give you a nic= e small key. This also avoids hot-spots, like all the titles that start wi= th "The" or a common letter, like "S". But you may need to deal with hash c= ollisions. Counters can give you "append" hot-spots. As you ingest, the most active t= ablet will always be the newest one. A random UUID is useful, but large, if you just want a unique identifier as= sociated with a title. Accumulo performance should not change if you have 1 table or 100. But tab= les are a convenient unit for management. You can offline, compact and del= ete a table. You can configure many table-specific properties which can gi= ve you performance benefits. -Eric On Wed, Jun 6, 2012 at 4:46 PM, Perko, Ralph J > wrote: Hi, I am in the process of designing some Accumulo tables for an app and h= ave some questions: Lookup Table: The data's natural qualifier is a title. This title can be any length. So= me are as long as 200 characters. I am using this title as a row id and also as a column qualifier in other p= laces. Is it considered good practice to have a lookup table for these titles (lik= e RDBMS), replacing them with some incremented integer value, or should I j= ust continue to use these long titles as row ids? Multiple Tables: What are the best practices around when to create a new table? I have been= breaking up my tables based on row id semantics. For example, title row i= ds are in a different table than row ids based on some analysis count. Does breaking up data into multiple tables, help/hurt/ or do nothing for ac= cumulo performance? Thanks, Ralph __________________________________________________ Ralph Perko Pacific Northwest National Laboratory