Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 26829E3C8 for ; Thu, 14 Feb 2013 20:49:00 +0000 (UTC) Received: (qmail 35579 invoked by uid 500); 14 Feb 2013 20:48:58 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 35512 invoked by uid 500); 14 Feb 2013 20:48:58 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 35504 invoked by uid 99); 14 Feb 2013 20:48:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Feb 2013 20:48:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of viral.bajaria@gmail.com designates 209.85.128.47 as permitted sender) Received: from [209.85.128.47] (HELO mail-qe0-f47.google.com) (209.85.128.47) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Feb 2013 20:48:51 +0000 Received: by mail-qe0-f47.google.com with SMTP id 2so1276627qea.6 for ; Thu, 14 Feb 2013 12:48:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=3b0NS5mJQ3UsjS6l/22aTo2WeoJDIpiCyJNkyEsckqI=; b=0GC0yCDMLquwF4MFCj64u4xij1nA1tB8teqZV8LFJcTF2XkaYoZldbfeeTDRZixGIY oxwafFuNHMZ7sgANhogphux6U1PIVChIcKIDbZcyGgMYyNHd/A/4BAsE7uKysiU9hZQp VVP/TRyCBaPKTY6tr4R83DcxnaWCRAsyCS+WDGdzEYzgM+sExWeFJibGsWeGBX7THASS F5PXgtpPZpvVF07JLyqIbFChkO5YJYJAD9w/QJ+ahOwMD8Ma3IeMJyDzGn7iJKcGkO98 gYaztWtaPaJw0URH9dFwjkfz3aCG0Yxir6/BChGbydqSzIyArqnyHx+qpYTkgz3Q7wwu jepg== MIME-Version: 1.0 X-Received: by 10.224.193.202 with SMTP id dv10mr1158214qab.77.1360874910689; Thu, 14 Feb 2013 12:48:30 -0800 (PST) Received: by 10.49.63.74 with HTTP; Thu, 14 Feb 2013 12:48:30 -0800 (PST) In-Reply-To: <0FCBF5E7-49BD-46A8-9B49-24549312EC2E@yahoo.com> References: <-5717258686639839845@unknownmsgid> <0FCBF5E7-49BD-46A8-9B49-24549312EC2E@yahoo.com> Date: Thu, 14 Feb 2013 12:48:30 -0800 Message-ID: Subject: Re: Using HBase for Deduping From: Viral Bajaria To: Rahul Ravindran Cc: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=20cf300fb4b3a2a8cf04d5b5616a X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fb4b3a2a8cf04d5b5616a Content-Type: text/plain; charset=ISO-8859-1 You could do with a 2-pronged approach here i.e. some MR and some HBase lookups. I don't think this is the best solution either given the # of events you will get. FWIW, the solution below again relies on the assumption that if a event is duped in the same hour it won't have a dupe outside of that hour boundary. If it can have then you are better of with running a MR job with the current hour + another 3 hours of data or an MR job with the current hour + the HBase table as input to the job too (i.e. no HBase lookups, just read the HFile directly) ? - Run a MR job which de-dupes events for the current hour i.e. only runs on 1 hour worth of data. - Mark records which you were not able to de-dupe in the current run - For the records that you were not able to de-dupe, check against HBase whether you saw that event in the past. If you did, you can drop the current event or update the event to the new value (based on your business logic) - Save all the de-duped events (via HBase bulk upload) Sorry if I just rambled along, but without knowing the whole problem it's very tough to come up with a probable solution. So correct my assumptions and we could drill down more. Thanks, Viral On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran wrote: > Most will be in the same hour. Some will be across 3-6 hours. > > Sent from my phone.Excuse the terseness. > > On Feb 14, 2013, at 12:19 PM, Viral Bajaria > wrote: > > > Are all these dupe events expected to be within the same hour or they > > can happen over multiple hours ? > > > > Viral > > From: Rahul Ravindran > > Sent: 2/14/2013 11:41 AM > > To: user@hbase.apache.org > > Subject: Using HBase for Deduping > > Hi, > > We have events which are delivered into our HDFS cluster which may > > be duplicated. Each event has a UUID and we were hoping to leverage > > HBase to dedupe them. We run a MapReduce job which would perform a > > lookup for each UUID on HBase and then emit the event only if the UUID > > was absent and would also insert into the HBase table(This is > > simplistic, I am missing out details to make this more resilient to > > failures). My concern is that doing a Read+Write for every event in MR > > would be slow (We expect around 1 Billion events every hour). Does > > anyone use Hbase for a similar use case or is there a different > > approach to achieving the same end result. Any information, comments > > would be great. > > > > Thanks, > > ~Rahul. > --20cf300fb4b3a2a8cf04d5b5616a--