accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Question about best practices on column names
Date Wed, 27 May 2015 14:50:56 GMT
Couple of clarifications:

* Identical rowIDs will colocate data in the same tablet, but not 
necessarily the same file. Tablets can have multiple files.

* Locality groups will colocate data within a file, not necessarily in 
its own file. RFile's format support multiple "regions" within the file 
which correspond to locality groups.

To David's original question, I like to think of the family/qualifier 
breakdown in the general case as follows: the family is used for a 
coarse grouping of similar data while the qualifier is used as some 
name/identifier for the value.

Accumulo's flexibility in how the data model is implemented 
(specifically the ability to store any column family in a table via the 
default locality group), lets you implement much more advanced "schemas" 
in Accumulo, but the above is definitely the "typical" case if you look 
to "BigTable" use in general IMO.

Andrew Wells wrote:
> On the surface it adds an additional level of specification/grouping.
> The potential benefit we have in accumulo is that along with the fact
> that identical rowID's are guaranteed to be in the same file. You can
> use Locality Groups, to place specific Column Families into the same
> file as well. Providing faster scans when looking for a specific column
> family.
> On Wed, May 27, 2015 at 9:05 AM, David Patterson <
> <>> wrote:
>     I've been trying to understand the difference between the two column
>     name parts -- column family and column qualifier. I don't understand
>     the value of using the columnFamily for the column name and an
>     "empty text" (new Text(new byte[0])) field for the column qualifier
>     vs. a non-unique column name and the distinct column name in the
>     column qualifier position.
>     I can sort-of understand the distinction if I have multiple distinct
>     kinds of data in my data collection. I could use the column family
>     part to determine how to interpret the rest of the data (what
>     columns I can expect, etc.). But, that kind of data could also be
>     handled with multiple databases.
>     Any guidance would be appreciated.
>     Thanks.
>     Davie Patterson
> --
> *Andrew George Wells*
> *Software Engineer*
> * <>*

View raw message