Return-Path: Delivered-To: apmail-incubator-chukwa-user-archive@www.apache.org Received: (qmail 34784 invoked from network); 22 Oct 2010 16:47:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Oct 2010 16:47:40 -0000 Received: (qmail 38673 invoked by uid 500); 22 Oct 2010 16:47:40 -0000 Delivered-To: apmail-incubator-chukwa-user-archive@incubator.apache.org Received: (qmail 38654 invoked by uid 500); 22 Oct 2010 16:47:40 -0000 Mailing-List: contact chukwa-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: chukwa-user@incubator.apache.org Delivered-To: mailing list chukwa-user@incubator.apache.org Received: (qmail 38647 invoked by uid 99); 22 Oct 2010 16:47:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 16:47:40 +0000 X-ASF-Spam-Status: No, hits=1.1 required=10.0 tests=NO_RDNS_DOTCOM_HELO,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Oct 2010 16:47:32 +0000 Received: from SP2-EX07CAS05.ds.corp.yahoo.com (sp2-ex07cas05.corp.sp2.yahoo.com [98.137.59.39]) by mrout1.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id o9MGkcM4097276 for ; Fri, 22 Oct 2010 09:46:38 -0700 (PDT) Received: from SP2-EX07VS05.ds.corp.yahoo.com ([98.137.59.23]) by SP2-EX07CAS05.ds.corp.yahoo.com ([98.137.59.39]) with mapi; Fri, 22 Oct 2010 09:46:37 -0700 From: Eric Yang To: "chukwa-user@incubator.apache.org" Date: Fri, 22 Oct 2010 09:46:35 -0700 Subject: Re: Seeing duplicate entries Thread-Topic: Seeing duplicate entries Thread-Index: ActyBWwKtq/jQSFaSU2hU7OuJeLxBAAA0QS3 Message-ID: In-Reply-To: <1287764497.509121118@192.168.2.230> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Note, the Dedup collector is only good for a single collector. If you use multiple collector, it will not help. Regards, Eric On 10/22/10 9:21 AM, "Matt Davies" wrote: > Thank you for the insight. >=20 > "Ariel Rabkin" said: >=20 >> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang wrote: >>> Hi Matt, >>=20 >>=20 >>>=20 >>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>> mysql. =A0The same primary key will update to the same row to remove >>> duplicates. =A0It is possible to build a duplication detection process >>> prior to demux which filter data based on sequence id + data type + >>> csource (host), but this hasn't been implemented because primary key >>> update method works well for my use case. >>=20 >> This isn't quite right. There is support in 0.3 and later versions for >> doing de-duplication at the collector, in the manner Eric describes. >> It works as a filter in the writer pipeline. >>=20 >> You need the following in your configuration: >>=20 >> >> chukwaCollector.writerClass >> =20 >> org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWrite= r> lue> >> >>=20 >> >> chukwaCollector.pipeline >> org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.h= adoop >> .chukwa.datacollection.writer.SeqFileWriter >> >>=20 >>=20 >> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for >> background >>=20 >>=20 >> --Ari >>=20 >> -- >> Ari Rabkin asrabkin@gmail.com >> UC Berkeley Computer Science Department >>=20 >=20 >=20 >=20