Return-Path: Delivered-To: apmail-incubator-chukwa-user-archive@www.apache.org Received: (qmail 28221 invoked from network); 22 Oct 2010 16:22:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Oct 2010 16:22:06 -0000 Received: (qmail 99129 invoked by uid 500); 22 Oct 2010 16:22:06 -0000 Delivered-To: apmail-incubator-chukwa-user-archive@incubator.apache.org Received: (qmail 99076 invoked by uid 500); 22 Oct 2010 16:22:06 -0000 Mailing-List: contact chukwa-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-user@incubator.apache.org Delivered-To: mailing list chukwa-user@incubator.apache.org Received: (qmail 99069 invoked by uid 99); 22 Oct 2010 16:22:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 16:22:06 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 207.97.245.131 is neither permitted nor denied by domain of matt.davies@tynt.com) Received: from [207.97.245.131] (HELO smtp131.iad.emailsrvr.com) (207.97.245.131) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 16:21:58 +0000 Received: from smtp33.relay.iad1a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp33.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id A69E130DB1 for ; Fri, 22 Oct 2010 12:21:37 -0400 (EDT) X-SMTPDoctor-Processed: csmtpprox 2.7.4 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp33.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id A1DD030DE4 for ; Fri, 22 Oct 2010 12:21:37 -0400 (EDT) X-Virus-Scanned: OK Received: from dynamic8.wm-web.iad.mlsrvr.com (dynamic8.wm-web.iad1a.rsapps.net [192.168.2.149]) by smtp33.relay.iad1a.emailsrvr.com (SMTP Server) with ESMTP id 8F8F130DB1 for ; Fri, 22 Oct 2010 12:21:37 -0400 (EDT) Received: from tynt.com (localhost [127.0.0.1]) by dynamic8.wm-web.iad.mlsrvr.com (Postfix) with ESMTP id 7D8B2305006C for ; Fri, 22 Oct 2010 12:21:37 -0400 (EDT) Received: by email.rackspace.com (Authenticated sender: matt.davies@tynt.com, from: matt.davies@tynt.com) with HTTP; Fri, 22 Oct 2010 10:21:37 -0600 (MDT) Date: Fri, 22 Oct 2010 10:21:37 -0600 (MDT) Subject: Re: Seeing duplicate entries From: "Matt Davies" To: chukwa-user@incubator.apache.org MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: References: <1287717739.772721747@192.168.2.227> Message-ID: <1287764497.509121118@192.168.2.230> X-Mailer: webmail8 X-Virus-Checked: Checked by ClamAV on apache.org Thank you for the insight.=0A=0A"Ariel Rabkin" said:= =0A=0A> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang wrot= e:=0A>> Hi Matt,=0A> =0A> =0A>>=0A>> The duplication filtering in Chukwa 0.= 3.0 depends on data loading to=0A>> mysql. =C2=A0The same primary key will = update to the same row to remove=0A>> duplicates. =C2=A0It is possible to b= uild a duplication detection process=0A>> prior to demux which filter data = based on sequence id + data type +=0A>> csource (host), but this hasn't bee= n implemented because primary key=0A>> update method works well for my use = case.=0A> =0A> This isn't quite right. There is support in 0.3 and later ve= rsions for=0A> doing de-duplication at the collector, in the manner Eric de= scribes.=0A> It works as a filter in the writer pipeline.=0A> =0A> You need= the following in your configuration:=0A> =0A> =0A> chukw= aCollector.writerClass=0A> org.apache.hadoop.chukwa.datacol= lection.writer.PipelineStageWriter=0A> =0A> =0A> =0A> chukwaCollector.pipeline=0A> org.apache.hadoo= p.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollectio= n.writer.SeqFileWriter=0A> =0A> =0A> =0A> See http://inc= ubator.apache.org/chukwa/docs/r0.3.0/collector.html for background=0A> =0A>= =0A> --Ari=0A> =0A> --=0A> Ari Rabkin asrabkin@gmail.com=0A> UC Berkeley C= omputer Science Department=0A> =0A