Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC0A1C16D for ; Thu, 7 Jun 2012 13:05:29 +0000 (UTC) Received: (qmail 43299 invoked by uid 500); 7 Jun 2012 13:05:29 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 43276 invoked by uid 500); 7 Jun 2012 13:05:29 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 43268 invoked by uid 99); 7 Jun 2012 13:05:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jun 2012 13:05:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eric.newton@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yw0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jun 2012 13:05:22 +0000 Received: by yhr47 with SMTP id 47so419894yhr.0 for ; Thu, 07 Jun 2012 06:05:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2YJX17clO4wr5GNNMDVtQl+JCN1oehJcAOOEJ85evJI=; b=Jr7SbuWvLiRz/L9YeFasuQtEayOkkSYF4poPvOOYKr+50vG8MyyPlz6QWoTWh5nPxc pWpK+z4H4dGNhuJMyi44KjUeD7uOSuxHYQ7qYyFynLfJXH/zm3w6DHwPtsOy5udYFuwV GRiU//a3S08hN7ktBTAq97egi8Sl5rpjfuHrr3XRb2iU3vmlRstRvi0OFTyDG9KoGKQc ZBT9KQ491821GZiXR8Dc6y07uyeX570t4OH/qG/kEGTwpiYeOXCDhLs6WSFewMNlGkzC S6jiHQAPP8wsdRfd3t6cNXxkswTjQMvt9MC15JNy4BcwUEDmCAleJ+0qO4QywzuiHOvE ydig== MIME-Version: 1.0 Received: by 10.50.41.201 with SMTP id h9mr670191igl.18.1339074301589; Thu, 07 Jun 2012 06:05:01 -0700 (PDT) Received: by 10.50.100.233 with HTTP; Thu, 7 Jun 2012 06:05:01 -0700 (PDT) In-Reply-To: References: Date: Thu, 7 Jun 2012 09:05:01 -0400 Message-ID: Subject: Re: Table design From: Eric Newton To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=14dae934095f12f81a04c1e1886a --14dae934095f12f81a04c1e1886a Content-Type: text/plain; charset=ISO-8859-1 Some thoughts: Accumulo will accomodate keys that are very large (like 100K) but I don't recommend it. It makes indexes big and slows down just about every operation. A row-id or column qualifier that is 200 bytes long is not extreme. Remember that compression will decrease the storage requirements, especially since the sort creates natural redundancy in the row id. Is it important to find "Three men and a baby" just after "Three little pigs"? If not, hash the title and look up the hash. That will give you a nice small key. This also avoids hot-spots, like all the titles that start with "The" or a common letter, like "S". But you may need to deal with hash collisions. Counters can give you "append" hot-spots. As you ingest, the most active tablet will always be the newest one. A random UUID is useful, but large, if you just want a unique identifier associated with a title. Accumulo performance should not change if you have 1 table or 100. But tables are a convenient unit for management. You can offline, compact and delete a table. You can configure many table-specific properties which can give you performance benefits. -Eric On Wed, Jun 6, 2012 at 4:46 PM, Perko, Ralph J wrote: > Hi, I am in the process of designing some Accumulo tables for an app and > have some questions: > > Lookup Table: > The data's natural qualifier is a title. This title can be any length. > Some are as long as 200 characters. > I am using this title as a row id and also as a column qualifier in other > places. > Is it considered good practice to have a lookup table for these titles > (like RDBMS), replacing them with some incremented integer value, or should > I just continue to use these long titles as row ids? > > Multiple Tables: > What are the best practices around when to create a new table? I have > been breaking up my tables based on row id semantics. For example, title > row ids are in a different table than row ids based on some analysis count. > Does breaking up data into multiple tables, help/hurt/ or do nothing for > accumulo performance? > > Thanks, > Ralph > __________________________________________________ > Ralph Perko > Pacific Northwest National Laboratory > > > --14dae934095f12f81a04c1e1886a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Some thoughts:

Accumulo will accomodate keys that are ve= ry large (like 100K) but I don't recommend it. It makes indexes big and= slows down just about every operation. =A0A row-id or column qualifier tha= t is 200 bytes long is not extreme.=A0 Remember that compression will decre= ase the storage requirements, especially since the sort creates natural red= undancy in the row id.

Is it important to find "Three men and a baby"= ; just after "Three little pigs"? =A0If not, hash the title and l= ook up the hash. =A0That will give you a nice small key. =A0This also avoid= s hot-spots, like all the titles that start with "The" or a commo= n letter, like "S". But you may need to deal with hash collisions= .

Counters can give you "append" hot-spots. =A0= As you ingest, the most active tablet will always be the newest one.
<= div>
A random UUID is useful, but large, if you just want a u= nique identifier associated with a title.

Accumulo performance should not change if you have 1 ta= ble or 100. =A0But tables are a convenient unit for management. =A0You can = offline, compact and delete a table. =A0You can configure many table-specif= ic properties which can give you performance benefits.

-Eric

On Wed, = Jun 6, 2012 at 4:46 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov> wrote:
Hi, =A0I am in the process of designing some= Accumulo tables for an app and have some questions:

Lookup Table:
The data's natural qualifier is a title. =A0This title can be any lengt= h. =A0Some are as long as 200 characters.
I am using this title as a row id and also as a column qualifier in other p= laces.
Is it considered good practice to have a lookup table for these titles (lik= e RDBMS), replacing them with some incremented integer value, or should I j= ust continue to use these long titles as row ids?

Multiple Tables:
What are the best practices around when to create a new table? =A0I have be= en breaking up my tables based on row id semantics. =A0For example, title r= ow ids are in a different table than row ids based on some analysis count.<= br> Does breaking up data into multiple tables, help/hurt/ or do nothing for ac= cumulo performance?

Thanks,
Ralph
__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory



--14dae934095f12f81a04c1e1886a--