hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal
Date Tue, 06 Sep 2011 16:58:55 GMT
Thanks for the below Duane.  Helps.

See below.

On Tue, Sep 6, 2011 at 9:21 AM, Duane Moore <duane.moore@issinc.com> wrote:
> - Column Families
> In HBase you must specify all column families up front as part of the
> table schema declaration when creating a table.
> Accumulo does not have this restriction, you do not declare column
> families when you create a table. When you insert a new row into the table
> you can just provide a new column family.
> ** Note: sounds like from what Stack said, this is close to being OBE?

I'm about to get an Order of the British Empire

Yeah, I think Overtaken By Events seems about right.

Having the client free form add column families seems like a bad idea
to me.  There should be some friction since there are physical
impliciations each time a new CF is added.

But then Accumulo has a form of locality groups so it seems to me that
this freeform adding of column families is just something that follows
on from their having locatlity groups (I wonder how you do locality
group editing in Accumulo?  Do you have to take the table offline?)

> - Aggregation
> Accumulo offers the ability to specify an aggregator for an individual
> column family or column. This allows you to keep a row count, or summation
> of numerical values that may be stored in a particular column. It would
> appear the function has to operate on the subset of values stored for that
> column in the table at a particular time since it keeps the aggregate
> value in memory. So this may not be able to handle certain aggregation
> functions like 'median' for instance. But functions like sum, max, min,
> mean, and count should all be supportable.
> I could not find a comparable feature within HBase, but HBase does offer
> an atomic function called incremementColumnValue on the HTable class which
> appears can be leveraged to provide aggregation behavior.

Yeah, we have ICVs and you can aggregate outside of HBase in the
client but it sounds like the above is a subset of
https://issues.apache.org/jira/browse/HBASE-1512, committed to TRUNK?

> - Column Visibility
> This is the feature in Accumulo that allows tagging of the data at the
> column level, which would primarily be used for classification markings
> (in our scenario).
> If we were to implement the same type of column visibility in HBase that
> Accumulo supports, we would have potentially several options:
> -Try to implement column visibility as a patch to HBase. Would be fun, but
> may be a lot of work.
> -Since the value of a particular column (cell, actually) is simply a byte
> array, we could utilize a standard technique of encoding the visibility
> level/classification in the column value itself.
> -Since the number of columns is not pre-defined, adopt a convention
> whereby each column "foo" gets an additional column added by our
> infrastructure called "foo_visibility".
> ** Note: We have a requirement to use PKI (digital certificates) for
> authentication in our service stack. The relationship between PKI and
> Kerberos currently used for Secure HBase is interesting; not quite sure
> how the two would fit together in practice.

We'd entertain #1 (Gary above cites an issue where he ruminates on
what would be involved:
https://issues.apache.org/jira/browse/HBASE-3435).  I don't get why
this has to be in the KV rather than as a version of #2 (but hey, I'm
slow).  #3 sounds a little messy.  #4 sounds like the proper way to
get per user auth.  I'd be interested in helping out getting that to

> -Retrieving Data
> Accumulo uses a Scanner object for all retrieval operations, which are
> instantiated by retrieving a Scanner from the Connector object. When
> retrieving all values for a particular row, the _individual cells are
> returned as a new entry_ returned by the Scanner iterator.
> In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan)
> or you can use a Get object, which allows you to retrieve a single row at
> a time. In either case, the org.apache.hadoop.hbase.client.Result class is
> returned, representing all of the requested data for that particular row.
> In HBase, to set constraints on a query, you set a
> org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple
> Filters may be set by using the FilterList object. In Accumulo, you call
> the setScanIterators() method on the Scanner object, which enables the
> appropriate iterators for use on the server before returning data.
> ** Note: primary difference here is in the use of server-side iterators,
> which Andy has correctly pointed out could be implemented via the
> coprocessor framework.  We did some initial investigation into
> coprocessors to see if we could implement this equivalent functionality,
> but since we'd been directed to use Accumulo, we didn't have much
> bandwidth to address this (also coprocessors were in their infancy at the
> time).

Yeah, sounds like it.

> Hope that helps.  Bottom line is that I believe that the features in
> Accumulo can and ought to be merged into HBase at some point (assuming the
> technical merits hold up).  Looking forward to contributing to that
> conversation.

Thanks for the helpful note Duane.

Good stuff,

View raw message