hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Mixing Puts and Deletes in a single RPC
Date Fri, 06 Jul 2012 17:44:36 GMT
Well then, I guess if you really want to save space... 

Take your DAO... add a method that takes the fields and writes them to JSON string and converts
it to a byte Array.  (AVRO?) 
If your field is null, you just don't add it to the output string.

That would take care of all of your overhead issues with the meta data. 

Problem solved...

Now you don't have to worry about your meta data storage space since everything is all in
a single byte array in a single column.


Of course this doesn't help if you want to filter your queries based on individual values.

(I guess you could write a filter that takes a json string, the field name, value, and comparison
type as input and then outputs a T/F.)  This would be server side and would have to be placed
on every node, but... it would work.)

Or you could parse out those fields that you need to help identify the record in terms of
filtering, and store them in separate columns.  So you store the main record in a column,
and then individual fields.... 

Again, no deletes necessary and of course not a lot of additional overhead. 

Without really understanding your use case, how you use the data... its hard to determine
what's optimal. 

In my first post, while not optimal in terms of storage, its pretty straightforward, simple
to implement, and is faster in terms of I/O access.... 

HTH... 

On Jul 6, 2012, at 10:42 AM, Keith Wyss wrote:

> Hi Michael,
> 
> Thank you for your reply.
> 
> I will answer your questions one by one.
> 
> ---I have to ask... why are you deleting anything from your columns?
> 
> 
> In the event that there is a null value in a source system, we have to
> reflect that somehow and assume that there could be a value in HBase and
> that the source system has recently dropped the value. The options are a
> Put or a Delete. Delete is preferable because it reduces disk space.
> 
> 
> ---Your current logic is to take a field that contains a NULL and then
> delete the contents from HBase. Why?
> No really, Why?
> 
> 
> Our current logic is to write a masking KeyValue containing nothing in the
> value that makes the most recent value empty. Unfortunately this takes up
> a quarter of our space in the grid. If truly deleting can reacquire 25% of
> the HBase space, that sounds awesome, and should speed up reads with less
> KeyValues.
> 
> ---You then have to make your DAO object class more complex when reading
> from HBase because you need to account for the fields that are NULL.
> 
> 
> This is a non-issue for us. It already accounts for nulls.
> 
> ----If you just insert the NULL value in the column, you don't have that
> issue.
> Do you waste disk space? Sure. But really, does that matter?
> 
> Yes. The accompanying metadata eats up a quarter of our space.
> 
> ----For your specific use case, you may actually want to track
> (versioning) the values in that column. So that if there was a value and
> then its gone, you're going to want to know its history.
> 
> I think this is the most applicable downside to propagating deletes
> instead of puts.
> 
> 
> Storing a null in a row store makes a lot of sense to me. You eat a fixed
> amount of space depending on the size of the row. In a sparse column store
> like HBase, the frame and metadata are a real overhead. There's definitely
> a tradeoff between this storage and the benefits of simply treating HBase
> like a row store, and thats why I am curious if other engineers have
> addressed this and are willing to weigh in.
> 
> Thanks for your suggestions. I think the point about versioning is very
> good and I will think long about that.
> 
> Keith
> 
> 
> On 7/6/12 7:51 AM, "Michael Segel" <michael_segel@hotmail.com> wrote:
> 
>> I was going to post this yesterday, but real work got in the way...
>> 
>> I have to ask... why are you deleting anything from your columns?
>> 
>> The reason I ask is that you're sync'ing an object from an RDBMS to
>> HBase. While HBase allows fields that contain NULL not to exist, your
>> RDBMS doesn't. 
>> 
>> Your current logic is to take a field that contains a NULL and then
>> delete the contents from HBase. Why?
>> No really, Why? 
>> 
>> You then have to make your DAO object class more complex when reading
>> from HBase because you need to account for the fields that are NULL.
>> (In your use case, the DAO is reading/writing against NoSQL and RDBMSs.
>> So its a consistency issue.)
>> 
>> If you just insert the NULL value in the column, you don't have that
>> issue. 
>> Do you waste disk space? Sure. But really, does that matter?
>> 
>> For your specific use case, you may actually want to track (versioning)
>> the values in that column. So that if there was a value and then its
>> gone, you're going to want to know its history.
>> (For Auditing at a minimum.)
>> 
>> I don't know, its your application. The point is that you are making
>> things more complex and you should think about the alternatives in design
>> and the true cost differences.
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Jul 5, 2012, at 1:28 PM, Ted Yu wrote:
>> 
>>> Take a look at HBASE-3584: Allow atomic put/delete in one call
>>> It is in 0.94, meaning it is not even in cdh4
>>> 
>>> Cheers
>>> 
>>> On Thu, Jul 5, 2012 at 11:19 AM, Keith Wyss <keith.wyss@explorys.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> My organization has been doing something zany to simulate atomic row
>>>> operations is HBase.
>>>> 
>>>> We have a converter-object model for the writables that are populated
>>>> in
>>>> an HBase table, and one of the governing assumptions
>>>> is that if you are dealing with an Object record, you read all the
>>>> columns
>>>> that compose it out of HBase or a different data source.
>>>> 
>>>> When we read lots of data in from a source system that we are trying to
>>>> mirror with HBase, if a column is null that means that whatever is
>>>> in HBase for that column is no longer valid. We  have simulated what I
>>>> believe is now called a AtomicRowMutation by using a single Put
>>>> and populating it with blanks. The downside is the wasted space
>>>> accrued by
>>>> the metadata for the blank columns.
>>>> 
>>>> Atomicity is not of utmost importance to us, but performance is. My
>>>> approach has been to create a Put and Delete object for a record and
>>>> populate the Delete with the null columns. Then we call
>>>> HTable.batch(List<Row>) on a bunch of these. It is my impression that
>>>> this
>>>> shouldn't appreciably increase network traffic as the RPC calls will be
>>>> bundled.
>>>> 
>>>> Has anyone else addressed this problem? Does this seem like a
>>>> reasonable
>>>> approach?
>>>> What sort of performance overhead should I expect?
>>>> 
>>>> Also, I've seen some Jira tickets about making this an atomic
>>>> operation in
>>>> its own right. Is that something that
>>>> I can expect with CDH3U4?
>>>> 
>>>> Thanks,
>>>> 
>>>> Keith Wyss
>>>> 
>> 
>> 
> 
> 
> 


Mime
View raw message