incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <ey...@yahoo-inc.com>
Subject Re: Seeing duplicate entries
Date Fri, 22 Oct 2010 16:46:35 GMT
Note, the Dedup collector is only good for a single collector.  If you use
multiple collector, it will not help.

Regards,
Eric

On 10/22/10 9:21 AM, "Matt Davies" <matt.davies@tynt.com> wrote:

> Thank you for the insight.
> 
> "Ariel Rabkin" <asrabkin@gmail.com> said:
> 
>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <eric818@gmail.com> wrote:
>>> Hi Matt,
>> 
>> 
>>> 
>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>> mysql.  The same primary key will update to the same row to remove
>>> duplicates.  It is possible to build a duplication detection process
>>> prior to demux which filter data based on sequence id + data type +
>>> csource (host), but this hasn't been implemented because primary key
>>> update method works well for my use case.
>> 
>> This isn't quite right. There is support in 0.3 and later versions for
>> doing de-duplication at the collector, in the manner Eric describes.
>> It works as a filter in the writer pipeline.
>> 
>> You need the following in your configuration:
>> 
>> <property>
>>   <name>chukwaCollector.writerClass</name>
>>   
>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va
>> lue>
>> </property>
>> 
>> <property>
>>   <name>chukwaCollector.pipeline</name>
>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop
>> .chukwa.datacollection.writer.SeqFileWriter</value>
>> </property>
>> 
>> 
>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for
>> background
>> 
>> 
>> --Ari
>> 
>> --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>> 
> 
> 
> 


Mime
View raw message