incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Davies" <matt.dav...@tynt.com>
Subject Re: Seeing duplicate entries
Date Fri, 22 Oct 2010 19:23:11 GMT
Eric,

I've been playing out several ideas on where to put in the correction for our system.  Upon
investigation it seems that 2 separate demux operations see the duplicate record so doing
some sort of distinct in demux seems unreliable given our use.

It appears you are putting data into a database and using the db to enforce the uniqueness
constraint.  Do you see any way we could do a dedup operation after demux (within the chukwa
environment) if we write our data strait into HDFS? 

I could see writing a simple MR job to go and figure this stuff out for me, but it seems very
inelegant and introduces more delay before I can utilize the data.

Any other thoughts?

"Eric Yang" <eyang@yahoo-inc.com> said:

> Note, the Dedup collector is only good for a single collector.  If you use
> multiple collector, it will not help.
> 
> Regards,
> Eric
> 
> On 10/22/10 9:21 AM, "Matt Davies" <matt.davies@tynt.com> wrote:
> 
>> Thank you for the insight.
>>
>> "Ariel Rabkin" <asrabkin@gmail.com> said:
>>
>>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <eric818@gmail.com> wrote:
>>>> Hi Matt,
>>>
>>>
>>>>
>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>>> mysql.  The same primary key will update to the same row to remove
>>>> duplicates.  It is possible to build a duplication detection process
>>>> prior to demux which filter data based on sequence id + data type +
>>>> csource (host), but this hasn't been implemented because primary key
>>>> update method works well for my use case.
>>>
>>> This isn't quite right. There is support in 0.3 and later versions for
>>> doing de-duplication at the collector, in the manner Eric describes.
>>> It works as a filter in the writer pipeline.
>>>
>>> You need the following in your configuration:
>>>
>>> <property>
>>>   <name>chukwaCollector.writerClass</name>
>>>
>>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va
>>> lue>
>>> </property>
>>>
>>> <property>
>>>   <name>chukwaCollector.pipeline</name>
>>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop
>>> .chukwa.datacollection.writer.SeqFileWriter</value>
>>> </property>
>>>
>>>
>>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for
>>> background
>>>
>>>
>>> --Ari
>>>
>>> --
>>> Ari Rabkin asrabkin@gmail.com
>>> UC Berkeley Computer Science Department
>>>
>>
>>
>>
> 
> 



Mime
View raw message