incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Graham <>
Subject Re: Seeing duplicate entries
Date Mon, 25 Oct 2010 23:03:31 GMT
Thanks Eric, this is helpful. I dug around in the following files and
I think I have a handle on what's happening but I can use some


To make sure I'm clear, let me know if this is accurate:

1. SyslogAdaptor sends syslog message byte arrays as the chunk body
bound to the dataType for that facility.

2. In the collector configs, this config says to write data to HBase only:

If I also wanted to write data to HDFS, would I just need to add
",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a
third item in the chain?

3. In the collector configs, all packages beneath the package
configured in hbase.demux.package would be checked for the annotated
classes (it would be useful to have this also take a comma-separated
list at some point for extensibility). What about the data being sent
indicates that the SysLog processor should be used?

4. The collector via HBaseWriter writes the data to the
SystemMetrics/SysLog table/family in HBase per the annotations.
Looking at OutputCollector it appears the following data is set:

 - key is taken as the '[source]-[ts]' from the ChukwaRecordKey
 - column family seems to be taken as the reduceType (i.e. dataType),
but I thought that was set by the annotation in SysLog. Which is it?
 - column name/value is every field name and value in the ChukwaRecord.

This last part is throwing me off though, since I can't see where
field names and values are set on your ChukwaRecord. Can you clarify?
It seems like the record was just the entire byte array payload of the
syslog message.

Btw, the documentation is a big help thanks, but one bit of feedback
is that the "Configure Log4j syslog appender" section is confusing
w.r.t. what nodes your speaking of. I assume you're talking about the
Hadoop nodes being monitored, but is there anything about this
approach that limits this to monitoring Hadoop nodes only? Either way,
which nodes being discussed and which Hadoop cluster needs to be
rebooted should be clarified.


On Sat, Oct 23, 2010 at 8:34 PM, Eric Yang <> wrote:
> Yes, you are right.  It should work automatically after annotation is
> added to his demux parser.
> regards,
> Eric
> On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <> wrote:
>> +1
>> I imagine it is jst another pipelinable class loaded into the collector?  If
>> so bill's scenario would work.
>> Sent from my iPhone
>> On Oct 23, 2010, at 12:59 PM, Bill Graham <> wrote:
>>> Eric, I'm also curious about how the HBase integration works. Do you
>>> have time to write something up on it? I'm interested in the
>>> possibility of extending what's there to write my own custom data into
>>> HBase from a collector, while said data also continues through to HDFS
>>> as it does currently.
>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <>
>>> wrote:
>>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs?  What format
>>>> will the hbase data be in (e.g. A chukwarecord object ? Something user
>>>> configurable? )
>>>> Sent from my iPhone
>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <> wrote:
>>>>> Hi Matt,
>>>>> This is expected in Chukwa archives.  When agent is unable to post to
>>>>> the collector, it will retry to post the same data again to another
>>>>> collector or retrys with the same collector when no other collector is
>>>>> available.  Collector may have data written without proper acknowledge
>>>>> back to agent in high load situation.  Chukwa philosophy is to retry
>>>>> until receiving acknowledgement.  Duplicated data filter will be
>>>>> treated after data has been received.
>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>>>> mysql.  The same primary key will update to the same row to remove
>>>>> duplicates.  It is possible to build a duplication detection process
>>>>> prior to demux which filter data based on sequence id + data type +
>>>>> csource (host), but this hasn't been implemented because primary key
>>>>> update method works well for my use case.
>>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,
>>>>> where it will replace any duplicated row in HBase base on Timestamp +
>>>>> HBase row key.
>>>>> regards,
>>>>> Eric
>>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <>
>>>>> wrote:
>>>>>> Hey everyone,
>>>>>> I have a situation where I'm seeing duplicated data downstream before
>>>>>> the
>>>>>> demux process. It appears this happens during high system loads and
>>>>>> are
>>>>>> still using the 0.3.0 series.
>>>>>> So, we have validated that there is a single, unique entry in our
>>>>>> source
>>>>>> file which then shows up a random amount of times before we see it
>>>>>> demux.
>>>>>> So, it appears that there is duplication happening somewhere between
>>>>>> the
>>>>>> agent and collector.
>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this
>>>>>> during
>>>>>> high system loads, but not during lower loads.
>>>>>> TIA,
>>>>>> Matt

View raw message