incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Flume and Cassandra
Date Thu, 23 Feb 2012 03:22:29 GMT
I have been working on IronCount
(https://github.com/edwardcapriolo/IronCount/) which is designed to do
what you are talking about. Kafka takes care of the distributed
producer/consumer message queues and IronCount sets up custom
consumers to process those messages.

It might be what your are looking for. It is not as fancy as
s4/storm/flume but that is supposed to be the charm of it.

On Wed, Feb 22, 2012 at 1:55 PM, aaron morton <aaron@thelastpickle.com> wrote:
> Maybe Storm is what you are looking for (as well as flume to get the
> messages from the network)
> http://www.datastax.com/events/cassandranyc2011/presentations/marz
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 22/02/2012, at 2:23 AM, Alain RODRIGUEZ wrote:
>
> Thanks for answering.
>
> "This is a good starting point
> https://github.com/thobbs/flume-cassandra-plugin "
>
> I already saw that, but it only does a raw store of the logs. I would like
> too store them in a "smart way", I mean I'd like to store logs to be able to
> use information contained into them.
>
> If I have rows like : (date action/event/id_ad/id_transac)
>
> 1 - 2012-02-17 18:22:09 track/display/4/70
> 2 - 2012-02-17 18:22:09 track/display/2/70
> 3 - 2012-02-17 18:22:09 track/display/3/70
> 4 - 2012-02-17 18:22:29 track/start/3/70
> 5 - 2012-02-17 18:22:39 track/firstQuartile/3/70
> 6 - 2012-02-17 18:22:46 track/midpoint/3/70
> 7 - 2012-02-17 18:22:53 track/complete/3/70
> 8 - 2012-02-17 18:23:02 track/click/3/70
>
> I would like to process this logs to store in cassandra :
>
> 1 - increment the display counter for the ad 4, find the transac with id
> "70" in my database to get the id_product (let's say it's 19) and then
> increment the display counter for product 19. I would also store a raw data
> like event1: (event => display, ad => 4, transac => 70 ...)
>
> 2 - ...
> ...
>
> 7 - ...
>
> 8 - increment the click counter for the ad 3, find the transac with id "70"
> in my database to get the id_product (let's say it's 19) and then increment
> the  click counter for product 19. I would also store a raw data like event8
> : (event => click, ad => 3, transac => 70 ...) and update the status of the
> transaction to a "finish" state.
>
> I want a really custom behaviour, so I guess I'll have to build a specific
> flume sink (or is there a generic and configurable sink existing somewhere
> ?).
>
> Maybe should I use the flume-cassandra-plugin and process the data once
> already stored rawly ? In this case, how to be sure that I have proceed all
> the data and how to be sure doing it in real-time or near real-time ? Is
> this performant ?
>
> I hope you'll understand what I just wrote, it's not very simple, and I'm
> not fluent in English. Don't hesitate asking for more explanation.
>
> The final goal of all this is to have statistics in near real-time, on the
> same cluster than the OLTP which is critical to us. The real-time statistics
> have to be slowed (and become near real-time stats) when we are in rush
> hours in order to be fully performant in the business part.
>
> Alain
>
> 2012/2/10 aaron morton <aaron@thelastpickle.com>
>>
>> How to do it ? Do I need to build a custom plugin/sink or can I configure
>> an existing sink to write data in a custom way ?
>>
>> This is a good starting
>> point https://github.com/thobbs/flume-cassandra-plugin
>>
>> 2 - My business process also use my Cassandra DB (without flume, directly
>> via thrift), how to ensure that log writing won't overload my database and
>> introduce latency in my business process ?
>>
>> Anytime you have a data stream you don't control it's a good idea to put
>> some sort of buffer in there between the outside world and the database.
>> Flume has a buffered sync, I think your can subclass it and aggregate the
>> counters for a minute or
>> two http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics
>>
>> Hope that helps.
>> A
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:
>>
>> Hi,
>>
>> 1 - I would like to generate some statistics and store some raw events
>> from log files tailed with flume. I saw some plugins giving Cassandra sinks
>> but I would like to store data in a custom way, storing raw data but also
>> incrementing counters to get near real-time statistcis. How to do it ? Do I
>> need to build a custom plugin/sink or can I configure an existing sink to
>> write data in a custom way ?
>>
>> 2 - My business process also use my Cassandra DB (without flume, directly
>> via thrift), how to ensure that log writing won't overload my database and
>> introduce latency in my business process ? I mean, is there a way to to
>> manage the throughput sent by the flume's tails and slow them when my
>> Cassandra cluster is overloaded ? I would like to avoid building
>> 2 separated clusters.
>>
>> Thank you,
>>
>> Alain
>>
>>
>
>

Mime
View raw message