accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <Ralph.Pe...@pnnl.gov>
Subject Re: Table design
Date Thu, 07 Jun 2012 14:12:55 GMT
Excellent information – thanks

__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory



From: Eric Newton <eric.newton@gmail.com<mailto:eric.newton@gmail.com>>
Reply-To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" <user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
To: "user@accumulo.apache.org<mailto:user@accumulo.apache.org>" <user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: Table design

Some thoughts:

Accumulo will accomodate keys that are very large (like 100K) but I don't recommend it. It
makes indexes big and slows down just about every operation.  A row-id or column qualifier
that is 200 bytes long is not extreme.  Remember that compression will decrease the storage
requirements, especially since the sort creates natural redundancy in the row id.

Is it important to find "Three men and a baby" just after "Three little pigs"?  If not, hash
the title and look up the hash.  That will give you a nice small key.  This also avoids hot-spots,
like all the titles that start with "The" or a common letter, like "S". But you may need to
deal with hash collisions.

Counters can give you "append" hot-spots.  As you ingest, the most active tablet will always
be the newest one.

A random UUID is useful, but large, if you just want a unique identifier associated with a
title.

Accumulo performance should not change if you have 1 table or 100.  But tables are a convenient
unit for management.  You can offline, compact and delete a table.  You can configure many
table-specific properties which can give you performance benefits.

-Eric

On Wed, Jun 6, 2012 at 4:46 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov<mailto:Ralph.Perko@pnnl.gov>>
wrote:
Hi,  I am in the process of designing some Accumulo tables for an app and have some questions:

Lookup Table:
The data's natural qualifier is a title.  This title can be any length.  Some are as long
as 200 characters.
I am using this title as a row id and also as a column qualifier in other places.
Is it considered good practice to have a lookup table for these titles (like RDBMS), replacing
them with some incremented integer value, or should I just continue to use these long titles
as row ids?

Multiple Tables:
What are the best practices around when to create a new table?  I have been breaking up my
tables based on row id semantics.  For example, title row ids are in a different table than
row ids based on some analysis count.
Does breaking up data into multiple tables, help/hurt/ or do nothing for accumulo performance?

Thanks,
Ralph
__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory




Mime
View raw message