Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
From: Ipremyadav <ipremyadav@gmail.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)
Subject: Re: Cassandra - Spark - Flume: best architecture for log analytics.
Message-Id: <2ED3F552-F20F-4B7A-AD26-1E4AF4828FAE@gmail.com>
Date: Thu, 23 Jul 2015 08:51:25 +0100
References: <55B02FA8.4090906@gmail.com>
 <CA+PAuzcpDdDe_JQcEGP2o3c7a-Su1Bvfi3z=a=PYKZsmY7YONw@mail.gmail.com>
In-Reply-To: 
 <CA+PAuzcpDdDe_JQcEGP2o3c7a-Su1Bvfi3z=a=PYKZsmY7YONw@mail.gmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>


--Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Though DSE cassandra comes with hadoop integration, this is clearly is use c=
ase for hadoop.=20
Any reason why cassandra is your first choice?


> On 23 Jul 2015, at 6:12 a.m., Pierre Devops <pierredevops@gmail.com> wrote=
:
>=20
> Cassandra is not very good at massive read/bulk read if you need to retrie=
ve and compute a large amount of data on multiple machines using something l=
ike spark or hadoop (or you'll need to hack and process the sstable directly=
, something which is not "natively" supported, you'll have to hack your way)=

>=20
> However, it's very good to store and retrieve them once they have been pro=
cessed and sorted. That's why I would opt for solution 2) or for another sol=
ution which process data before inserting them in cassandra, and doesn't use=
 cassandra as a temporary store.
>=20
> 2015-07-23 2:04 GMT+02:00 Renato Perini <renato.perini@gmail.com>:
>> Problem: Log analytics.
>>=20
>> Solutions:
>>        1) Aggregating logs using Flume and storing the aggregations into C=
assandra. Spark reads data from Cassandra, make some computations
>> and write the results in distinct tables, still in Cassandra.
>>        2) Aggregating logs using Flume to a sink, streaming data directly=
 into Spark. Spark make some computations and store the results in Cassandra=
.
>>        3) *** your solution ***
>>=20
>> Which is the best workflow for this task?
>> I would like to setup something flexible enough to allow me to use batch p=
rocessing and realtime streaming without major fuss.
>>=20
>> Thank you in advance.
>=20

--Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; charset=3D=
utf-8"></head><body dir=3D"auto"><div>Though DSE cassandra comes with hadoop=
 integration, this is clearly is use case for hadoop.&nbsp;</div><div>Any re=
ason why cassandra is your first choice?<br><br><br></div><div><br>On 23 Jul=
 2015, at 6:12 a.m., Pierre Devops &lt;<a href=3D"mailto:pierredevops@gmail.=
com">pierredevops@gmail.com</a>&gt; wrote:<br><br></div><blockquote type=3D"=
cite"><div><div dir=3D"ltr">Cassandra is not very good at massive read/bulk r=
ead if you need to retrieve and compute a large amount of data on multiple m=
achines using something like spark or hadoop (or you'll need to hack and pro=
cess the sstable directly, something which is not "natively" supported, you'=
ll have to hack your way)<div><br></div><div>However, it's very good to stor=
e and retrieve them once they have been processed and sorted. That's why I w=
ould opt for solution 2) or for another solution which process data before i=
nserting them in cassandra, and doesn't use cassandra as a temporary store.<=
br><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2015-07-23 2:04=
 GMT+02:00 Renato Perini <span dir=3D"ltr">&lt;<a href=3D"mailto:renato.peri=
ni@gmail.com" target=3D"_blank">renato.perini@gmail.com</a>&gt;</span>:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex">Problem: Log analytics.<br>
<br>
Solutions:<br>
&nbsp; &nbsp; &nbsp; &nbsp;1) Aggregating logs using Flume and storing the a=
ggregations into Cassandra. Spark reads data from Cassandra, make some compu=
tations<br>
and write the results in distinct tables, still in Cassandra.<br>
&nbsp; &nbsp; &nbsp; &nbsp;2) Aggregating logs using Flume to a sink, stream=
ing data directly into Spark. Spark make some computations and store the res=
ults in Cassandra.<br>
&nbsp; &nbsp; &nbsp; &nbsp;3) *** your solution ***<br>
<br>
Which is the best workflow for this task?<br>
I would like to setup something flexible enough to allow me to use batch pro=
cessing and realtime streaming without major fuss.<br>
<br>
Thank you in advance.<br>
<br>
<br>
<br>
</blockquote></div><br></div></div></div>
</div></blockquote></body></html>=

--Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2--