hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal
Date Fri, 09 Sep 2011 19:24:47 GMT
Accumulo seems mostly like features we can roll into HBase. Decline.

On Fri, Sep 9, 2011 at 2:50 PM, Andrew Purtell <apurtell@apache.org> wrote:
>> From: Duane Moore <duane.moore@issinc.com>
>
>> I will second what Todd and Joey
>> said and reiterate that contributing to open source is not easy for a
>> government contractor, and especially not easy for U.S. government
>> employees.
>
>
> This is true as a general statement I'm sure.
>
> However, my former life was as an engineer in a DARPA shop with a TS clearance. During
that time I worked on both closed/classified systems and projects such as TrustedBSD (http://www.trustedbsd.org/).
Choosing to develop an internal alternative rather than work with the HBase project was a
decision of convenience by someone.
>
> While all appreciate this eventual open sourcing on some level, the outcome is hardly
optimal, and does not favor in my opinion the existing open source community here (HBase)
in the short term, and any long term favor is going to require work by that community.
>
>> My personal preference for a long while has been to migrate
>> our Accumulo implementation to HBase, but as with any project there are
>> often non-technical considerations for doing so.
>
>
> I can only hope that open source communities in general will apply a penalty for taking
the easy way out for such non-technical considerations. We do not have to act as beggars.
Presumably this open sourcing was not done out of charity -- I would be quite surprised, maybe
shocked. If government (or contractors) want to leverage open source communities for some
benefit, the least we can do is insist on respectful terms.
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
>
>
> ----- Original Message -----
>> From: Duane Moore <duane.moore@issinc.com>
>> To: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> Cc:
>> Sent: Tuesday, September 6, 2011 9:21 AM
>> Subject: Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache
Incubator as a proposal
>>
>> Hello all,
>>
>> I've been a lurker on the HBase list for a year or so and our company has
>> also been working with the Accumulo implementation during the same time
>> frame.  I'd like to respond to Stack's suggestion to focus on the
>> technical merits of the proposal.  Since I have some info on the pre-open
>> sourced version of Accumulo, I'd like to share some of our evaluation of
>> the software, primarily from a client perspective (vs. implementation
>> details like logging to NFS vs HDFS).
>>
>> First, I share many of the same concerns of folks who were frustrated that
>> this project seems to duplicate the effort of the open source
>> (particularly HBase) community.  However, I will second what Todd and Joey
>> said and reiterate that contributing to open source is not easy for a
>> government contractor, and especially not easy for U.S. government
>> employees.  My personal preference for a long while has been to migrate
>> our Accumulo implementation to HBase, but as with any project there are
>> often non-technical considerations for doing so.
>>
>> Below are some notes we took last year on the differences between Accumulo
>> and HBase, with additional notes from me inline.  Much of this mirrors
>> what is in the current Accumulo proposal.
>>
>> -----
>>
>> - Column Families
>> In HBase you must specify all column families up front as part of the
>> table schema declaration when creating a table.
>> Accumulo does not have this restriction, you do not declare column
>> families when you create a table. When you insert a new row into the table
>> you can just provide a new column family.
>> ** Note: sounds like from what Stack said, this is close to being OBE?
>>
>>
>> - Aggregation
>> Accumulo offers the ability to specify an aggregator for an individual
>> column family or column. This allows you to keep a row count, or summation
>> of numerical values that may be stored in a particular column. It would
>> appear the function has to operate on the subset of values stored for that
>> column in the table at a particular time since it keeps the aggregate
>> value in memory. So this may not be able to handle certain aggregation
>> functions like 'median' for instance. But functions like sum, max, min,
>> mean, and count should all be supportable.
>> I could not find a comparable feature within HBase, but HBase does offer
>> an atomic function called incremementColumnValue on the HTable class which
>> appears can be leveraged to provide aggregation behavior.
>>
>>
>> - Column Visibility
>> This is the feature in Accumulo that allows tagging of the data at the
>> column level, which would primarily be used for classification markings
>> (in our scenario).
>> If we were to implement the same type of column visibility in HBase that
>> Accumulo supports, we would have potentially several options:
>> -Try to implement column visibility as a patch to HBase. Would be fun, but
>> may be a lot of work.
>> -Since the value of a particular column (cell, actually) is simply a byte
>> array, we could utilize a standard technique of encoding the visibility
>> level/classification in the column value itself.
>> -Since the number of columns is not pre-defined, adopt a convention
>> whereby each column "foo" gets an additional column added by our
>> infrastructure called "foo_visibility".
>> ** Note: We have a requirement to use PKI (digital certificates) for
>> authentication in our service stack. The relationship between PKI and
>> Kerberos currently used for Secure HBase is interesting; not quite sure
>> how the two would fit together in practice.
>>
>> -Retrieving Data
>> Accumulo uses a Scanner object for all retrieval operations, which are
>> instantiated by retrieving a Scanner from the Connector object. When
>> retrieving all values for a particular row, the _individual cells are
>> returned as a new entry_ returned by the Scanner iterator.
>> In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan)
>> or you can use a Get object, which allows you to retrieve a single row at
>> a time. In either case, the org.apache.hadoop.hbase.client.Result class is
>> returned, representing all of the requested data for that particular row.
>> In HBase, to set constraints on a query, you set a
>> org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple
>> Filters may be set by using the FilterList object. In Accumulo, you call
>> the setScanIterators() method on the Scanner object, which enables the
>> appropriate iterators for use on the server before returning data.
>> ** Note: primary difference here is in the use of server-side iterators,
>> which Andy has correctly pointed out could be implemented via the
>> coprocessor framework.  We did some initial investigation into
>> coprocessors to see if we could implement this equivalent functionality,
>> but since we'd been directed to use Accumulo, we didn't have much
>> bandwidth to address this (also coprocessors were in their infancy at the
>> time).
>>
>>
>>
>> -----
>>
>>
>> Hope that helps.  Bottom line is that I believe that the features in
>> Accumulo can and ought to be merged into HBase at some point (assuming the
>> technical merits hold up).  Looking forward to contributing to that
>> conversation.
>>
>> Thanks,
>> Duane
>>
>> On 9/3/11 2:21 PM, "Stack" <stack@duboce.net> wrote:
>>
>>>
>>> I'd suggest we refocus this thread on how to respond to the Accumulo
>>> proposal (or whether to respond at all), since thats what we 'know'.
>>> I think it'd be useful correcting at least the 'unlikely tos'
>> with
>>> pointers to committed code.
>>>
>>> Code overlap, if any, can be addressed when the code drop happens.
>>>
>>> St.Ack
>>>
>>
>



-- 
Bradford Stephens,
Founder, Drawn to Scale
http://drawntoscale.com
(530) 763-DATA

http://www.drawntoscale.com -- Spire, the scalable database with
real-time queries and fulltext search.

http://www.roadtofailure.com -- The Fringes of Scalability, Startups
and Computer Science

Mime
View raw message