incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Flume and Cassandra
Date Wed, 22 Feb 2012 18:55:55 GMT
Maybe Storm is what you are looking for (as well as flume to get the messages from the network)
http://www.datastax.com/events/cassandranyc2011/presentations/marz

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/02/2012, at 2:23 AM, Alain RODRIGUEZ wrote:

> Thanks for answering.
> 
> "This is a good starting point https://github.com/thobbs/flume-cassandra-plugin "
> 
> I already saw that, but it only does a raw store of the logs. I would like too store
them in a "smart way", I mean I'd like to store logs to be able to use information contained
into them.
> 
> If I have rows like : (date action/event/id_ad/id_transac)
> 
> 1 - 2012-02-17 18:22:09 track/display/4/70
> 2 - 2012-02-17 18:22:09 track/display/2/70
> 3 - 2012-02-17 18:22:09 track/display/3/70
> 4 - 2012-02-17 18:22:29 track/start/3/70
> 5 - 2012-02-17 18:22:39 track/firstQuartile/3/70
> 6 - 2012-02-17 18:22:46 track/midpoint/3/70
> 7 - 2012-02-17 18:22:53 track/complete/3/70
> 8 - 2012-02-17 18:23:02 track/click/3/70
> 
> I would like to process this logs to store in cassandra :
> 
> 1 - increment the display counter for the ad 4, find the transac with id "70" in my database
to get the id_product (let's say it's 19) and then increment the display counter for product
19. I would also store a raw data like event1: (event => display, ad => 4, transac =>
70 ...)
> 
> 2 - ...
> ...
> 
> 7 - ...
> 
> 8 - increment the click counter for the ad 3, find the transac with id "70" in my database
to get the id_product (let's say it's 19) and then increment the  click counter for product
19. I would also store a raw data like event8 : (event => click, ad => 3, transac =>
70 ...) and update the status of the transaction to a "finish" state.
> 
> I want a really custom behaviour, so I guess I'll have to build a specific flume sink
(or is there a generic and configurable sink existing somewhere ?).
> 
> Maybe should I use the flume-cassandra-plugin and process the data once already stored
rawly ? In this case, how to be sure that I have proceed all the data and how to be sure doing
it in real-time or near real-time ? Is this performant ?
> 
> I hope you'll understand what I just wrote, it's not very simple, and I'm not fluent
in English. Don't hesitate asking for more explanation.  
> 
> The final goal of all this is to have statistics in near real-time, on the same cluster
than the OLTP which is critical to us. The real-time statistics have to be slowed (and become
near real-time stats) when we are in rush hours in order to be fully performant in the business
part.
> 
> Alain
> 
> 2012/2/10 aaron morton <aaron@thelastpickle.com>
>> How to do it ? Do I need to build a custom plugin/sink or can I configure an existing
sink to write data in a custom way ?
> This is a good starting point https://github.com/thobbs/flume-cassandra-plugin
> 
>> 2 - My business process also use my Cassandra DB (without flume, directly via thrift),
how to ensure that log writing won't overload my database and introduce latency in my business
process ?
> Anytime you have a data stream you don't control it's a good idea to put some sort of
buffer in there between the outside world and the database. Flume has a buffered sync, I think
your can subclass it and aggregate the counters for a minute or two http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics
> 
> Hope that helps. 
> A
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:
> 
>> Hi,
>> 
>> 1 - I would like to generate some statistics and store some raw events from log files
tailed with flume. I saw some plugins giving Cassandra sinks but I would like to store data
in a custom way, storing raw data but also incrementing counters to get near real-time statistcis.
How to do it ? Do I need to build a custom plugin/sink or can I configure an existing sink
to write data in a custom way ?
>> 
>> 2 - My business process also use my Cassandra DB (without flume, directly via thrift),
how to ensure that log writing won't overload my database and introduce latency in my business
process ? I mean, is there a way to to manage the throughput sent by the flume's tails and
slow them when my Cassandra cluster is overloaded ? I would like to avoid building 2 separated
clusters.
>> 
>> Thank you,
>> 
>> Alain
>> 
> 
> 


Mime
View raw message