Thanks for answering.
I already saw that, but it only does a raw store of the logs. I would like too store them in a "smart way", I mean I'd like to store logs to be able to use information contained into them.
If I have rows like : (date action/event/id_ad/id_transac)
1 - 2012-02-17 18:22:09 track/display/4/70
2 - 2012-02-17 18:22:09 track/display/2/70
3 - 2012-02-17 18:22:09 track/display/3/70
4 - 2012-02-17 18:22:29 track/start/3/70
5 - 2012-02-17 18:22:39 track/firstQuartile/3/70
6 - 2012-02-17 18:22:46 track/midpoint/3/70
7 - 2012-02-17 18:22:53 track/complete/3/70
8 - 2012-02-17 18:23:02 track/click/3/70
I would like to process this logs to store in cassandra :
1 - increment the display counter for the ad 4, find the transac with id "70" in my database to get the id_product (let's say it's 19) and then increment the display counter for product 19. I would also store a raw data like event1: (event => display, ad => 4, transac => 70 ...)
2 - ...
7 - ...
8 - increment the click counter for the ad 3, find the transac with id "70" in my database to get the id_product (let's say it's 19) and then increment the click counter for product 19. I would also store a raw data like event8 : (event => click, ad => 3, transac => 70 ...) and update the status of the transaction to a "finish" state.
I want a really custom behaviour, so I guess I'll have to build a specific flume sink (or is there a generic and configurable sink existing somewhere ?).
Maybe should I use the flume-cassandra-plugin and process the data once already stored rawly ? In this case, how to be sure that I have proceed all the data and how to be sure doing it in real-time or near real-time ? Is this performant ?
I hope you'll understand what I just wrote, it's not very simple, and I'm not fluent in English. Don't hesitate asking for more explanation.
The final goal of all this is to have statistics in near real-time, on the same cluster than the OLTP which is critical to us. The real-time statistics have to be slowed (and become near real-time stats) when we are in rush hours in order to be fully performant in the business part.
2012/2/10 aaron morton <firstname.lastname@example.org>
How to do it ? Do I need to build a custom plugin/sink or can I configure an existing sink to write data in a custom way ?
This is a good starting point https://github.com/thobbs/flume-cassandra-plugin
2 - My business process also use my Cassandra DB (without flume, directly via thrift), how to ensure that log writing won't overload my database and introduce latency in my business process ?
Anytime you have a data stream you don't control it's a good idea to put some sort of buffer in there between the outside world and the database. Flume has a buffered sync, I think your can subclass it and aggregate the counters for a minute or two http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics
Hope that helps.
On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:
1 - I would like to generate some statistics and store some raw events from log files tailed with flume. I saw some plugins giving Cassandra sinks but I would like to store data in a custom way, storing raw data but also incrementing counters to get near real-time statistcis. How to do it ? Do I need to build a custom plugin/sink or can I configure an existing sink to write data in a custom way ?
2 - My business process also use my Cassandra DB (without flume, directly via thrift), how to ensure that log writing won't overload my database and introduce latency in my business process ? I mean, is there a way to to manage the throughput sent by the flume's tails and slow them when my Cassandra cluster is overloaded ? I would like to avoid building 2 separated clusters.