chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Seeing duplicate entries
Date Sun, 24 Oct 2010 03:33:11 GMT
Hi Bill,

I started the document in this wiki page:

There is a architecture diagram to describe the new setup.  Your
existing parser should work with Chukwa 0.5, and by adding Chukwa
annotations to the parser, it will stream data into the HBase table.
I recommend to take a look of SystemMetrics demux parser, it's a good
example to follow for updating your existing parser to work with

In the default chukwa-collector-conf.xml.template, there is a section
for HBase configuration, uncomment it, and comment out the default
seqFileWriter.  Restart the collector, and data should appear in


On Sat, Oct 23, 2010 at 12:59 PM, Bill Graham <> wrote:
> Eric, I'm also curious about how the HBase integration works. Do you
> have time to write something up on it? I'm interested in the
> possibility of extending what's there to write my own custom data into
> HBase from a collector, while said data also continues through to HDFS
> as it does currently.
> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <> wrote:
>> Eric in chukwa 0.5 is hbase the final store instead of hdfs?  What format
>> will the hbase data be in (e.g. A chukwarecord object ? Something user
>> configurable? )
>> Sent from my iPhone
>> On Oct 22, 2010, at 8:48 AM, Eric Yang <> wrote:
>>> Hi Matt,
>>> This is expected in Chukwa archives.  When agent is unable to post to
>>> the collector, it will retry to post the same data again to another
>>> collector or retrys with the same collector when no other collector is
>>> available.  Collector may have data written without proper acknowledge
>>> back to agent in high load situation.  Chukwa philosophy is to retry
>>> until receiving acknowledgement.  Duplicated data filter will be
>>> treated after data has been received.
>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>> mysql.  The same primary key will update to the same row to remove
>>> duplicates.  It is possible to build a duplication detection process
>>> prior to demux which filter data based on sequence id + data type +
>>> csource (host), but this hasn't been implemented because primary key
>>> update method works well for my use case.
>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,
>>> where it will replace any duplicated row in HBase base on Timestamp +
>>> HBase row key.
>>> regards,
>>> Eric
>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <> wrote:
>>>> Hey everyone,
>>>> I have a situation where I'm seeing duplicated data downstream before the
>>>> demux process. It appears this happens during high system loads and we are
>>>> still using the 0.3.0 series.
>>>> So, we have validated that there is a single, unique entry in our source
>>>> file which then shows up a random amount of times before we see it in demux.
>>>> So, it appears that there is duplication happening somewhere between the
>>>> agent and collector.
>>>> Has anyone else seen this? Any ideas as to why we are seeing this during
>>>> high system loads, but not during lower loads.
>>>> TIA,
>>>> Matt

View raw message