accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Question about best practices on column names
Date Wed, 27 May 2015 14:52:02 GMT
David-

Both the column family (CF) and column qualifier (CQ) could be thought of
as arbitrary dimensions in the key. If you only need one dimension to
specify your data, the other can be empty. You could also store these in
separate tables, as you suggest, but part of the power of Accumulo is that
you don't actually need to separate your data this way. You can keep it in
the same table, organized by CF and, as Andrew alludes to, you can store
specific CFs in a particular locality group for faster access when querying
just data in those CFs.

If you have only one category of data, I'd recommend storing the specifier
in the CQ, not the CF. Although they look like they would be equivalent for
this case, you're more resilient to future changes if you use the CQ,
because now you can reserve the CF for later changes to the schema you're
using or for another kind of data you want to mix in later.

In addition, I would recommend using the CF to store data from a finite set
(such as "one of {STRING, DATE, INT}" or "one of {CategoryA, CategoryB}" or
"one of {Schema1, Schema2}", etc.), while you can use the CQ to store
arbitrary data (such as "<date>", "<number>", "<name>", etc.). The reason
for this is that locality groups, should you ever decide to use them at
some point in the future, can only (currently) be specified as a finite
discrete set of CFs, and not a pattern or other predicate. So, not storing
arbitrary data in the CFs will leave that option available to you.

Basically, you are right that you can use either, or both for your column
names, but there's a few good practices which might help you decide which
is better to use for your data.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Wed, May 27, 2015 at 9:17 AM, Andrew Wells <awells@clearedgeit.com>
wrote:

> On the surface it adds an additional level of specification/grouping.
>
> The potential benefit we have in accumulo is that along with the fact that
> identical rowID's are guaranteed to be in the same file. You can use
> Locality Groups, to place specific Column Families into the same file as
> well. Providing faster scans when looking for a specific column family.
>
>
>
> On Wed, May 27, 2015 at 9:05 AM, David Patterson <patterd@gmail.com>
> wrote:
>
>> I've been trying to understand the difference between the two column name
>> parts -- column family and column qualifier. I don't understand the value
>> of using the columnFamily for the column name and an "empty text" (new
>> Text(new byte[0])) field for the column qualifier vs. a non-unique column
>> name and the distinct column name in the column qualifier position.
>>
>>
>> I can sort-of understand the distinction if I have multiple distinct
>> kinds of data in my data collection. I could use the column family part to
>> determine how to interpret the rest of the data (what columns I can expect,
>> etc.). But, that kind of data could also be handled with multiple databases.
>>
>> Any guidance would be appreciated.
>>
>> Thanks.
>>
>> Davie Patterson
>>
>
>
>
> --
> *Andrew George Wells*
> *Software Engineer*
> *awells@clearedgeit.com <awells@clearedgeit.com>*
>
>

Mime
View raw message