incubator-chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <>
Subject Re: Seeing duplicate entries
Date Tue, 26 Oct 2010 00:28:15 GMT
Hi Bill,

On 10/25/10 4:03 PM, "Bill Graham" <> wrote:

> Thanks Eric, this is helpful. I dug around in the following files and
> I think I have a handle on what's happening but I can use some
> clarifications:
> oahc.datacollection.adaptor.SyslogAdaptor
> oahc.extraction.demux.processor.mapper.SysLog
> oahc.datacollection.writer.hbase.OutputCollector
> conf/hbase.schema
> conf/chukwa-collector-conf.xml.template
> To make sure I'm clear, let me know if this is accurate:
> 1. SyslogAdaptor sends syslog message byte arrays as the chunk body
> bound to the dataType for that facility.

Yes, Syslog message looks like this:

<142>This is a log entry

The facility name is derived from the first 3 digit number, priority +
severity + facility number*8.  Hence, the SyslogAdaptor manually maps the
existing 24 data types into data type make sense to Chukwa.  For example, a
syslog message with facility LOCAL0, and SyslogAdaptor looks up for running
SyslogAdaptor on port 9095, facility LOCAL1 maps to HADOOP.  Chunk data is
stamped as HADOOP for demux.  This mapping is added in
chukwa-agent-conf.xml, like this:


> 2. In the collector configs, this config says to write data to HBase only:
> <property>
> <name>chukwaCollector.pipeline</name>
> <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apac
> he.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
> </property>
> If I also wanted to write data to HDFS, would I just need to add
> ",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a
> third item in the chain?

Yes.  Make sure writerClass is configured to use PipelineStageWriter.


> 3. In the collector configs, all packages beneath the package
> configured in hbase.demux.package would be checked for the annotated
> classes (it would be useful to have this also take a comma-separated
> list at some point for extensibility). What about the data being sent
> indicates that the SysLog processor should be used?

HBaseWriter reads chukwa-demux-conf.xml if it is available in collector's
conf directory.  Hence, mappings of data type to parser is the same as demux
on hdfs.

> 4. The collector via HBaseWriter writes the data to the
> SystemMetrics/SysLog table/family in HBase per the annotations.
> Looking at OutputCollector it appears the following data is set:
>  - key is taken as the '[source]-[ts]' from the ChukwaRecordKey
>  - column family seems to be taken as the reduceType (i.e. dataType),
> but I thought that was set by the annotation in SysLog. Which is it?
>  - column name/value is every field name and value in the ChukwaRecord.
> This last part is throwing me off though, since I can't see where
> field names and values are set on your ChukwaRecord. Can you clarify?
> It seems like the record was just the entire byte array payload of the
> syslog message.

This is currently set to reduceType.  The annotation for column does nothing
at this moment.  In the future, it would be nice to have reduce type map to
annotation.  This means it will become more ORM entity bean code for demux
process.  I am not sure if that is something that we want Chukwa to do.  It
is nicer to have Apache Gora handle ORM for Hbase, hence Chukwa doesn't
detour from original objective.

SystemMetrics writes to SystemMetrics table.  Hadoop logs which streamed
through SyslogAdaptor is mapped to HADOOP.  I have not test the HADOOP
parser to see if Hadoop log processing is working.  This is on my TODO list.
In theory, it should work.  ;)

The annotation in SysLogAdaptor is only defining the which data type it is,
it has not define which parser to process the data.  This is done by demux
configuration.  I think the default behavior to map data type to demux
parser probably throw you off to assume data is processed by oahc.
extraction.demux.processor.mapper.SysLog.  Instead, you need to make sure
there is configuration in agent for mapping facility name to data type of
your choice, and configure demux to invoke the proper parser.  Let's say if
you are sending /var/log/messages with SyslogAdaptor, and map facility name
to SysLog and having demux configuration map to use SysLog.  Logs will
appear in Hbase table: SystemMetrics, SysLog column family, with a column
called "body" which contains all your log entries.  The buildGenericRecord
will create default record with body field.

There are some clean up work to decouple entity bean from our parser, then
demux will look nice and neat.  We should change serialization of
ChukwaRecord to avro, then it will make a lot of sense, and easier to
annotate columns.  For now, I only got bare minimum working.

> Btw, the documentation is a big help thanks, but one bit of feedback
> is that the "Configure Log4j syslog appender" section is confusing
> w.r.t. what nodes your speaking of. I assume you're talking about the
> Hadoop nodes being monitored, but is there anything about this
> approach that limits this to monitoring Hadoop nodes only? Either way,
> which nodes being discussed and which Hadoop cluster needs to be
> rebooted should be clarified.

Any log file written by SyslogAppender could be stream over to
SyslogAdaptor.  The only two required pieces are to write a demux parser
which can process your log file, and map facility name to demux parser.  For
Hadoop, the modification to should applies to all nodes
(namenode, jobtracker, datanode, tasktracker, secondary name node.)  Hence,
all logs can be streamed over and processed.  However, there is a lot of
data, and the current Chukwa parsers are not written to pick up all the
details.  When is changed, you will need to restart cluster
in order to take advantage of the changes.  Hope this helps.


> thanks,
> Bill
> On Sat, Oct 23, 2010 at 8:34 PM, Eric Yang <> wrote:
>> Yes, you are right.  It should work automatically after annotation is
>> added to his demux parser.
>> regards,
>> Eric
>> On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <>
>> wrote:
>>> +1
>>> I imagine it is jst another pipelinable class loaded into the collector?  If
>>> so bill's scenario would work.
>>> Sent from my iPhone
>>> On Oct 23, 2010, at 12:59 PM, Bill Graham <> wrote:
>>>> Eric, I'm also curious about how the HBase integration works. Do you
>>>> have time to write something up on it? I'm interested in the
>>>> possibility of extending what's there to write my own custom data into
>>>> HBase from a collector, while said data also continues through to HDFS
>>>> as it does currently.
>>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <>
>>>> wrote:
>>>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs?  What format
>>>>> will the hbase data be in (e.g. A chukwarecord object ? Something user
>>>>> configurable? )
>>>>> Sent from my iPhone
>>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <> wrote:
>>>>>> Hi Matt,
>>>>>> This is expected in Chukwa archives.  When agent is unable to post
>>>>>> the collector, it will retry to post the same data again to another
>>>>>> collector or retrys with the same collector when no other collector
>>>>>> available.  Collector may have data written without proper acknowledge
>>>>>> back to agent in high load situation.  Chukwa philosophy is to retry
>>>>>> until receiving acknowledgement.  Duplicated data filter will be
>>>>>> treated after data has been received.
>>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading
>>>>>> mysql.  The same primary key will update to the same row to remove
>>>>>> duplicates.  It is possible to build a duplication detection process
>>>>>> prior to demux which filter data based on sequence id + data type
>>>>>> csource (host), but this hasn't been implemented because primary
>>>>>> update method works well for my use case.
>>>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa
>>>>>> where it will replace any duplicated row in HBase base on Timestamp
>>>>>> HBase row key.
>>>>>> regards,
>>>>>> Eric
>>>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <>
>>>>>> wrote:
>>>>>>> Hey everyone,
>>>>>>> I have a situation where I'm seeing duplicated data downstream
>>>>>>> the
>>>>>>> demux process. It appears this happens during high system loads
and we
>>>>>>> are
>>>>>>> still using the 0.3.0 series.
>>>>>>> So, we have validated that there is a single, unique entry in
>>>>>>> source
>>>>>>> file which then shows up a random amount of times before we see
it in
>>>>>>> demux.
>>>>>>> So, it appears that there is duplication happening somewhere
>>>>>>> the
>>>>>>> agent and collector.
>>>>>>> Has anyone else seen this? Any ideas as to why we are seeing
>>>>>>> during
>>>>>>> high system loads, but not during lower loads.
>>>>>>> TIA,
>>>>>>> Matt

View raw message