hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Using HBase for Deduping
Date Fri, 15 Feb 2013 17:24:55 GMT

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!

On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <rahulrv@yahoo.com> wrote:

> I had tried checkAndPut yesterday with a null passed as the value and it had thrown an
exception when the row did not exist. Perhaps, I was doing something wrong. Will try that
again, since, yes, I would prefer a checkAndPut().
> ________________________________
> From: Michael Segel <michael_segel@hotmail.com>
> To: user@hbase.apache.org 
> Cc: Rahul Ravindran <rahulrv@yahoo.com> 
> Sent: Friday, February 15, 2013 4:36 AM
> Subject: Re: Using HBase for Deduping
> On Feb 15, 2013, at 3:07 AM, Asaf Mesika <asaf.mesika@gmail.com> wrote:
>> Michael, this means read for every write?
> Yes and no. 
> At the macro level, a read for every write would mean that your client would read a record
from HBase, and then based on some logic it would either write a record, or not. 
> So that you have a lot of overhead in the initial get() and then put(). 
> At this macro level, with a Check and Put you have less overhead because of a single
message to HBase.
> Intermal to HBase, you would still have to check the value in the row, if it exists and
then perform an insert or not. 
> WIth respect to your billion events an hour... 
> dividing by 3600 to get the number of events in a second. You would have less than 300,000
events a second. 
> What exactly are you doing and how large are those events? 
> Since you are processing these events in a batch job, timing doesn't appear to be that
important and of course there is also async hbase which may improve some of the performance.

> YMMV but this is a good example of the checkAndPut()
>> On Friday, February 15, 2013, Michael Segel wrote:
>>> What constitutes a duplicate?
>>> An over simplification is to do a HTable.checkAndPut() where you do the
>>> put if the column doesn't exist.
>>> Then if the row is inserted (TRUE) return value, you push the event.
>>> That will do what you want.
>>> At least at first blush.
>>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <viral.bajaria@gmail.com>
>>> wrote:
>>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>>> per hour), I don't think your most optimal solution is to lookup HBase
>>> for
>>>> every single event. You will benefit more by loading the HBase table
>>>> directly in your MR job.
>>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>>> UUID's ?
>>>> Also once you have done the unique, are you going to use the data again
>>> in
>>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>>> this is just to compute some unique #'s ?
>>>> It will be more helpful if you describe your final use case of the
>>> computed
>>>> data too. Given the amount of back and forth, we can take it off list too
>>>> and summarize the conversation for the list.
>>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <rahulrv@yahoo.com>
>>> wrote:
>>>>> We can't rely on the the assumption event dupes will not dupe outside
>>>>> hour boundary. So, your take is that, doing a lookup per event within
>>> the
>>>>> MR job is going to be bad?
>>>>> ________________________________
>>>>> From: Viral Bajaria <viral.bajaria@gmail.com>
>>>>> To: Rahul Ravindran <rahulrv@yahoo.com>
>>>>> Cc: "user@hbase.apache.org" <user@hbase.apache.org>
>>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>>> Subject: Re: Using HBase for Deduping
>>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>>> lookups. I don't think this is the best solution either given the # of
>>>>> events you will get.
>>>>> FWIW, the solution below again relies on the assumption that if a event
>>> is
>>>>> duped in the same hour it won't have a dupe outside of that hour
>>> boundary.
>>>>> If it can have then you are better of with running a MR job with the
>>>>> current hour + another 3 hours of data or an MR job with the current
>>> hour +
>>>>> the HBase table as input to the job too (i.e. no HBase lookups, just
>>> read
>>>>> the HFile directly) ?
>>>>> - Run a MR job which de-dupes events for the current hour i.e. only
>>> runs on
>>>>> 1 hour worth of data.
>>>>> - Mark records which you were not able to de-dupe in the current run
>>>>> - For the records that you were not able to de-dupe, check against HBase
>>>>> whether you saw that event in the past. If you did, you can drop the
>>>>> current event or update the event to the new value (based on your
>>> business
>>>>> logic)
>>>>> - Save all the de-duped events (via HBase bulk upload)
>>>>> Sorry if I just rambled along, but without knowing the whole problem
>>> it's
>>>>> very tough to come up with a probable solution. So correct my
>>> assumptions
>>>>> and we could drill down more.
>>>>> Thanks,
>>>>> Viral
>>>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <rahulrv@yahoo.com>
>>>>> wrote:
>>>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>>> Sent from my phone.Excuse the terseness.
>>>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <viral.bajaria@gmail.com>
>>>>>> wrote:
>>>>>>> Are all these dupe events expected to be within the same hour
or they
>>>>>>> can happen over multiple hours ?
>>>>>>> Viral
>>>>>>> From: Rahul Ravindran
>>>>>>> Sent: 2/14/2013 11:41 AM
>>>>>>> To: user@hbase.apache.org
>>>>>>> Subject: Using HBase for Deduping
>>>>>>> Hi,
>>>>>>>   We have events which are delivered into our HDFS cluster which
>>>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>>> Michael Segel  | (m) 312.755.9623
>>> Segel and Associates

Michael Segel  | (m) 312.755.9623

Segel and Associates

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message