chukwa-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <eric...@gmail.com>
Subject Re: piping data into Cassandra
Date Tue, 01 Nov 2011 16:46:20 GMT
Hi AD,

Glad it works for you.  If you are interested in contributing what you currently have for
Cassandra writer, feel free to post your code as a patch in a jira (http://issues.apache.org/jira/browse/CHUKWA).
 With license grant of your work, it will be easier for the open source community enhance
your work. :)

regards,
Eric

On Oct 31, 2011, at 8:42 PM, AD wrote:

> Eric,
> 
>  Just wanted to say thank you for all your help.  Everything is now working perfectly
with a custom demux parser for my apache log format and a custom writer (copied your hbase
as a starting point) into Cassandra.
> 
> To the broader community, if anyone is interested in helping re-factor the Cassandra
writer to a real working add-in  to Chukwa, happy to share what i have so far. I am not a
Java developer by trade so I am sure there are some tweaks, but happy to give back to the
Chukwa community to help advance its capabilities.
> 
> Cheers,
> AD
> 
> On Sun, Oct 30, 2011 at 4:18 PM, Eric Yang <eric818@gmail.com> wrote:
> Sounds about right.  For the multiple lines map to a chunk, you can modify Apache Rotatelog
to add Control-A character to end of the log entry, and use UTF8 File Tailing Adaptor to read
the file.  In this case, you will get your log entries in the same chunk rather than multiple
chunks.  Hope this helps.
> 
> regards,
> Eric
> 
> On Oct 29, 2011, at 7:42 PM, AD wrote:
> 
> > Ok thats what i thought.  I have been trying to backtrack through the code.  The
one thing i cant figure out is currently a chunk can be multiple lines in a logfile (for example).
 I cant see how to get the parser to return multiple "rows" for a single chunk to be inserted
back into Hbase.
> >
> > From backtracking it looks like the plan would be (for parsing apache logs)
> >
> > 1 - set the collector.pipeline to include org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter.
 Need 2 methods, add and init
> > 2 - write a new parser (ApachelogProcessor) in  chukwa.extraction.demux.processor.mapper
that implements AbstractProcessor and parses the Chunk (logfile entries).
> > 3 - the "add" method of CassandraWriter calls processor.process on the ApachelogProcessor
with a single chunk  from an array of chunks
> > 4 -  ApacheLogProcessor parse (called from process) method does a record.add("logfield","logvalue")
for each field in apachelog in addition to the buildGenericRecord call
> > 5 - ApacheLogProcessor calls output.collect (may need to write a custom OutputCollector)
for a single row in Cassandra.  Sets up the "put"
> > 6 - "add" method of CassandraWriter does a put of the new rowkey
> > 7 - loop to next chunk.
> >
> > Look about right?  If so the only open question is how to deal with "chunks" that
span multiple lines in a logfile and map them to a single row in Cassandra.
> >
> > On Sat, Oct 29, 2011 at 9:44 PM, Eric Yang <eric818@gmail.com> wrote:
> > The demux parsers can work with either Mapreduce or HBaseWriter.  If you want to
have the ability to write parser once, and operate with multiple data sink, it will be good
to implement a version of HBaseWriter for cassandra, and keeping parsing logic separate from
loading logic.
> >
> > For performance reason, it is entirely possible to implement Cassandra loader with
parsing logic inside to improve performance, but it will put you on another course than what
is planned for Chukwa.
> >
> > regards,
> > Eric
> >
> > On Oct 29, 2011, at 8:39 AM, AD wrote:
> >
> > > With the new imininent trunk (0.5) getting wired into HBase, does it make sense
for me to keep the Demux parser as the place to put this logic for writing to cassandra? 
Or does it make sense to implement a version of src/java/org/apache/hadoop/chukwa/datacollection/writer/hbase/HbaseWriter.java
for Cassandra so that the collector pushes it straight?
> > >
> > > If i want to use both HDFS and Cassandra, it seems the current pipeline config
would support this by doing something like
> > >
> > > <property>
> > >   <name>chukwaCollector.pipeline</name>     <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.cassandra.CassandraWriter</value>
> > >  </property>
> > >
> > >  Thoughts ?
> > >
> > >
> > > On Wed, Oct 26, 2011 at 10:16 PM, AD <straightflush@gmail.com> wrote:
> > > yep that did it, just updated my initial_adaptors to have dataType TsProcessor
and saw demux kick in.
> > >
> > > Thanks for the help.
> > >
> > >
> > >
> > > On Wed, Oct 26, 2011 at 9:22 PM, Eric Yang <eric818@gmail.com> wrote:
> > > See: http://incubator.apache.org/chukwa/docs/r0.4.0/agent.html and http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html
> > >
> > > The configuration are the same for collector based demux.  Hope this helps.
> > >
> > > regards,
> > > Eric
> > >
> > > On Oct 26, 2011, at 4:20 PM, AD wrote:
> > >
> > > > Thanks.  Sorry for being dense here, but where does the data type get
mapped from the agent to the collector when passing data so that demux will match ?
> > > >
> > > > On Wed, Oct 26, 2011 at 12:34 PM, Eric Yang <eric818@gmail.com>
wrote:
> > > > "dp" serves as two functions, first it loads data to mysql, second, it
runs SQL for aggregated views.  demuxOutputDir_* is created if the demux mapreduce produces
data.  Hence, make sure that there is a demux processor mapped to your data type for the extracting
process in chukwa-demux-conf.xml.
> > > >
> > > > regards,
> > > > Eric
> > > >
> > > > On Oct 26, 2011, at 5:15 AM, AD wrote:
> > > >
> > > > > Hmm, i am running bin/chukwa demux and i dont have anything past
dataSinkArchives, there is no directory named demuxOutputDir_*.
> > > > >
> > > > > Also isnt dp an aggregate view?  I need to parse the apache logs
to do custom reports on things like remote_host , query strings, etc so i was hoping to parse
the raw record and load it into Cassandra and run M/R there to do the aggregate views.  I
thought a new version of TSProcessor was the right place here but i could be wrong.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > >
> > > > >
> > > > > If not, how do you write a custom postProcessor?
> > > > >
> > > > > On Wed, Oct 26, 2011 at 12:57 AM, Eric Yang <eric818@gmail.com>
wrote:
> > > > > Hi AD,
> > > > >
> > > > > Data is stored in demuxOutputDir_* by demux and there is a
> > > > > postProcessorMananger (bin/chukwa dp) which monitors postProcess
> > > > > directory and load data to MySQL.  For your use case, you will need
to
> > > > > modify PostProcessorManager.java to adopt to your use case.  Hope
this
> > > > > helps.
> > > > >
> > > > > regards,
> > > > > Eric
> > > > >
> > > > > On Tue, Oct 25, 2011 at 6:34 PM, AD <straightflush@gmail.com>
wrote:
> > > > > > hello,
> > > > > >  I currently push apache logs into Chukwa.  I am trying to figure
out how to
> > > > > > get all those logs into Cassandra and run mapreduce there. 
Is the best
> > > > > > place to do this in Demux (right my own version of TSProcessor?)
> > > > > >  Also the data flow seems to miss a step.  The
> > > > > > page http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html
says in
> > > > > > 3.3 that
> > > > > >    - demux moves complete files to: dataSinkArchives/[yyyyMMdd]/*/*.done
> > > > > >  - the next step is to move files
> > > > > > from postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt
> > > > > >   How do they get from dataSinkArchives to postProcess?  does
this run
> > > > > > inside of DemuxManager or a separate process (bin/chukwa demux)
?
> > > > > >  Thanks
> > > > > >  AD
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> 
> 


Mime
View raw message