accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Cordova (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (ACCUMULO-452) Generalize locality groups
Date Thu, 08 Mar 2012 20:27:57 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225464#comment-13225464
] 

Aaron Cordova edited comment on ACCUMULO-452 at 3/8/12 8:26 PM:
----------------------------------------------------------------

Just another comment on the type of complexity I'd like to avoid. 

Specifically, it's good to have orthogonality in your features. 

Locality groups are for physical partitioning, timestamps are for data versioning. If someone
wants to partition their data into time-ranges they are free to do so, using locality groups.
They simply have to decide on what their column families will be, building some information
about time ranges into them, and assign them to locality groups. 

Another kind of partitioning happens with row IDs, allowing accesses to a small range of rows
to involve one or a small number of servers. This kind of partitioning is nice because it's
automatic, one doesn't have to worry about whether the ranges are the right granularity, Accumulo
splits based on size.

Now we're talking about adding a third way to physically split data, timestamps, and basing
it on something designed for some other purpose, which is data versioning.

Timestamps do allow users to only get data for a particular time period, but the intent is
to limit the data after the row and columns have been selected, or maybe for short scans.
I'm guessing your users want to scan over a lot of rows and columns, but that fall within
a particular time period. For this they should build time ranges into their rows or columns.

There are already two ways to let users do this, I think adding a third will just add additional
complexity and could interfere with the original versioning functionality. Not necessarily
code complexity, rather, complexity in the users' minds as to how to model their data. 
                
      was (Author: acordova):
    Just another comment on the type of complexity I'd like to avoid. 

Specifically, it's good to have orthogonality in your features. 

Locality groups are for physical partitioning, timestamps are for data versioning. If someone
wants to partition their data into time-ranges they are free to do so, using locality groups.
They simply have to decide on what their column families will be, building some information
about time ranges into them, and assign them to locality groups. 

Another kind of partitioning happens with row IDs, allowing accesses to a small range of rows
to involve one or a small number of servers. This kind of partitioning is nice because it's
automatic, one doesn't have to worry about whether the ranges are the right granularity, Accumulo
splits based on size.

Now we're talking about adding a third way to physically split data, timestamps, and basing
it on something designed for some other purpose, which is data versioning.

Timestamps do allow users to only get data for a particular time period, but the intent is
to limit the data after the row and columns have been selected, or maybe for short scans.
I'm guessing your users want to scan over a lot of rows and columns, but that fall within
a particular time period. For this they should build time ranges into their rows or columns.

There are already two ways to let users do this, I think adding a third will just add additional
complexity and could interfere with the original versioning functionality.

                  
> Generalize locality groups
> --------------------------
>
>                 Key: ACCUMULO-452
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-452
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>         Attachments: PartitionerDesign.txt
>
>
> Locality groups are a neat feature, but there is no reason to limit partitioning to column
families.  Data could be partitioned based on any criteria.  For example if a user is interested
in querying recent data and ageing off old data partitioning locality groups based in timestamp
would be useful.  This could be accomplished by letting users specify a partitioner plugin
that is used at compaction and scan time.  Scans would need an ability to pass options to
the partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message