accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Fuchs (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-452) Generalize locality groups
Date Thu, 08 Mar 2012 17:01:58 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225307#comment-13225307
] 

Adam Fuchs commented on ACCUMULO-452:
-------------------------------------

I had another thought on this: locality groups are good for features that are in a relatively
constant, low cardinality set with a fairly dense distribution across the primary partitioning
dimension. Also, queries must be aligned with the locality group frequently enough to amortize
the cost of that partitioning. This means that the current column family-based locality groups
only really help when cells in sorted order frequently oscillate between locality groups.
I want to say that this type of feature tends to be something that is explicitly modeled based
on how the user wants to query their data. If the user decides to put this information in
the row or the column qualifier, could they just as easily put it into the column family?
By the way, expressions like John mentions in ACCUMULO-164 help to groups high cardinality
features into a low cardinality set of groups, so I think we're on the same page there.

Partitioning based on the timestamp is an interesting consideration. In this case, you would
want a small number of ranges of timestamps to be "active" (not aged off yet) at any one time.
Timestamps are a bit special, though, because they tend to be inserted in increasing order.
Instead of using the locality group mechanism, we might achieve better performance by modifying
the major compaction selection algorithm to avoid merging files that have very different timestamp
ranges. Keeping track of timestamps on a per-file or per-block basis would support bulk filtering,
and would be as (or more) efficient than locality groups. Might this be another approach to
consider?

Like Aaron, I think we need some more details on envisioned scenarios in which more generic
locality groups would be useful before we jump too deeply into implementing them.
                
> Generalize locality groups
> --------------------------
>
>                 Key: ACCUMULO-452
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-452
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>
> Locality groups are a neat feature, but there is no reason to limit partitioning to column
families.  Data could be partitioned based on any criteria.  For example if a user is interested
in querying recent data and ageing off old data partitioning locality groups based in timestamp
would be useful.  This could be accomplished by letting users specify a partitioner plugin
that is used at compaction and scan time.  Scans would need an ability to pass options to
the partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message