Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 560F418C02 for ; Thu, 23 Jul 2015 07:51:56 +0000 (UTC) Received: (qmail 76829 invoked by uid 500); 23 Jul 2015 07:51:42 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 76701 invoked by uid 500); 23 Jul 2015 07:51:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 76559 invoked by uid 99); 23 Jul 2015 07:51:42 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jul 2015 07:51:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8AEEFD6E40 for ; Thu, 23 Jul 2015 07:51:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, MIME_QP_LONG_LINE=0.001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id AXOOsxLMqobF for ; Thu, 23 Jul 2015 07:51:40 +0000 (UTC) Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 74EBC215E2 for ; Thu, 23 Jul 2015 07:51:40 +0000 (UTC) Received: by wicmv11 with SMTP id mv11so11804847wic.0 for ; Thu, 23 Jul 2015 00:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=1iKYqISOYEVBc/UnBE4sbidlBV8We847J5VCYgLAzDY=; b=THP0sv/Pje+fvmCooAs0xcNke7ChnpwCrSLlZNdEkbTX5c7LxLICj/akymnbuLhs0G EkNVJS01cz9PVC2gf+Zz0UBRRmSLHsDKPQ+Qg2pGhD+QNx2OD8PZ8pyIAd7L0O/buVDf 316LWXjQBSb3JJSQ2zD7d61oSq/whD8BtlaPjZgeCObc1NBe5282b3SkU3XpbctuM21d VB0rzPm+XXhNiiPP/FyJIBq4XQ9Nl2amn/MZEDPG2OzvRbskjpMdP459mvzFdowrRXtg fCXaI0p+uNXBOrBTC2Aj1FpwCRQAfn5y8ijsDQo3ywnMvT7Xs8LUl1oOzjG+jxI0TYDP 0mzg== X-Received: by 10.180.78.136 with SMTP id b8mr47260517wix.44.1437637892750; Thu, 23 Jul 2015 00:51:32 -0700 (PDT) Received: from [192.168.0.2] ([78.17.48.197]) by smtp.gmail.com with ESMTPSA id sc16sm6166108wjb.28.2015.07.23.00.51.31 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 23 Jul 2015 00:51:31 -0700 (PDT) From: Ipremyadav Content-Type: multipart/alternative; boundary=Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) Subject: Re: Cassandra - Spark - Flume: best architecture for log analytics. Message-Id: <2ED3F552-F20F-4B7A-AD26-1E4AF4828FAE@gmail.com> Date: Thu, 23 Jul 2015 08:51:25 +0100 References: <55B02FA8.4090906@gmail.com> In-Reply-To: To: "user@cassandra.apache.org" X-Mailer: iPhone Mail (12H143) --Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Though DSE cassandra comes with hadoop integration, this is clearly is use c= ase for hadoop.=20 Any reason why cassandra is your first choice? > On 23 Jul 2015, at 6:12 a.m., Pierre Devops wrote= : >=20 > Cassandra is not very good at massive read/bulk read if you need to retrie= ve and compute a large amount of data on multiple machines using something l= ike spark or hadoop (or you'll need to hack and process the sstable directly= , something which is not "natively" supported, you'll have to hack your way)= >=20 > However, it's very good to store and retrieve them once they have been pro= cessed and sorted. That's why I would opt for solution 2) or for another sol= ution which process data before inserting them in cassandra, and doesn't use= cassandra as a temporary store. >=20 > 2015-07-23 2:04 GMT+02:00 Renato Perini : >> Problem: Log analytics. >>=20 >> Solutions: >> 1) Aggregating logs using Flume and storing the aggregations into C= assandra. Spark reads data from Cassandra, make some computations >> and write the results in distinct tables, still in Cassandra. >> 2) Aggregating logs using Flume to a sink, streaming data directly= into Spark. Spark make some computations and store the results in Cassandra= . >> 3) *** your solution *** >>=20 >> Which is the best workflow for this task? >> I would like to setup something flexible enough to allow me to use batch p= rocessing and realtime streaming without major fuss. >>=20 >> Thank you in advance. >=20 --Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Though DSE cassandra comes with hadoop= integration, this is clearly is use case for hadoop. 
Any re= ason why cassandra is your first choice?



On 23 Jul= 2015, at 6:12 a.m., Pierre Devops <pierredevops@gmail.com> wrote:

Cassandra is not very good at massive read/bulk r= ead if you need to retrieve and compute a large amount of data on multiple m= achines using something like spark or hadoop (or you'll need to hack and pro= cess the sstable directly, something which is not "natively" supported, you'= ll have to hack your way)

However, it's very good to stor= e and retrieve them once they have been processed and sorted. That's why I w= ould opt for solution 2) or for another solution which process data before i= nserting them in cassandra, and doesn't use cassandra as a temporary store.<= br>

2015-07-23 2:04= GMT+02:00 Renato Perini <renato.perini@gmail.com>:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #= ccc solid;padding-left:1ex">Problem: Log analytics.

Solutions:
       1) Aggregating logs using Flume and storing the a= ggregations into Cassandra. Spark reads data from Cassandra, make some compu= tations
and write the results in distinct tables, still in Cassandra.
       2) Aggregating logs using Flume to a sink, stream= ing data directly into Spark. Spark make some computations and store the res= ults in Cassandra.
       3) *** your solution ***

Which is the best workflow for this task?
I would like to setup something flexible enough to allow me to use batch pro= cessing and realtime streaming without major fuss.

Thank you in advance.




= --Apple-Mail-8184EDC7-D611-46D8-8365-D32E65A9B2D2--