chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Seeing duplicate entries
Date Sun, 24 Oct 2010 03:22:27 GMT
HBase only supports bytes.  What to store in the cell, is decided by
the demux parser.  Chukwa data are currently stored as byte string for
the parsers that I implemented.  User has full control of data type to
store into each HBase column by customize the demux parser.


On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <> wrote:
> Eric in chukwa 0.5 is hbase the final store instead of hdfs?  What format
> will the hbase data be in (e.g. A chukwarecord object ? Something user
> configurable? )
> Sent from my iPhone
> On Oct 22, 2010, at 8:48 AM, Eric Yang <> wrote:
>> Hi Matt,
>> This is expected in Chukwa archives.  When agent is unable to post to
>> the collector, it will retry to post the same data again to another
>> collector or retrys with the same collector when no other collector is
>> available.  Collector may have data written without proper acknowledge
>> back to agent in high load situation.  Chukwa philosophy is to retry
>> until receiving acknowledgement.  Duplicated data filter will be
>> treated after data has been received.
>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>> mysql.  The same primary key will update to the same row to remove
>> duplicates.  It is possible to build a duplication detection process
>> prior to demux which filter data based on sequence id + data type +
>> csource (host), but this hasn't been implemented because primary key
>> update method works well for my use case.
>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,
>> where it will replace any duplicated row in HBase base on Timestamp +
>> HBase row key.
>> regards,
>> Eric
>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <> wrote:
>>> Hey everyone,
>>> I have a situation where I'm seeing duplicated data downstream before the
>>> demux process. It appears this happens during high system loads and we are
>>> still using the 0.3.0 series.
>>> So, we have validated that there is a single, unique entry in our source
>>> file which then shows up a random amount of times before we see it in demux.
>>> So, it appears that there is duplication happening somewhere between the
>>> agent and collector.
>>> Has anyone else seen this? Any ideas as to why we are seeing this during
>>> high system loads, but not during lower loads.
>>> TIA,
>>> Matt

View raw message