chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Seeing duplicate entries
Date Fri, 22 Oct 2010 15:48:27 GMT
Hi Matt,

This is expected in Chukwa archives.  When agent is unable to post to
the collector, it will retry to post the same data again to another
collector or retrys with the same collector when no other collector is
available.  Collector may have data written without proper acknowledge
back to agent in high load situation.  Chukwa philosophy is to retry
until receiving acknowledgement.  Duplicated data filter will be
treated after data has been received.

The duplication filtering in Chukwa 0.3.0 depends on data loading to
mysql.  The same primary key will update to the same row to remove
duplicates.  It is possible to build a duplication detection process
prior to demux which filter data based on sequence id + data type +
csource (host), but this hasn't been implemented because primary key
update method works well for my use case.

In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,
where it will replace any duplicated row in HBase base on Timestamp +
HBase row key.


On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <> wrote:
> Hey everyone,
> I have a situation where I'm seeing duplicated data downstream before the demux process.
It appears this happens during high system loads and we are still using the 0.3.0 series.
> So, we have validated that there is a single, unique entry in our source file which then
shows up a random amount of times before we see it in demux. So, it appears that there is
duplication happening somewhere between the agent and collector.
> Has anyone else seen this? Any ideas as to why we are seeing this during high system
loads, but not during lower loads.
> TIA,
> Matt

View raw message