Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E5E4C102F5 for ; Fri, 20 Feb 2015 21:43:38 +0000 (UTC) Received: (qmail 81103 invoked by uid 500); 20 Feb 2015 21:43:35 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 81065 invoked by uid 500); 20 Feb 2015 21:43:35 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 81055 invoked by uid 99); 20 Feb 2015 21:43:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Feb 2015 21:43:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of doanduyhai@gmail.com designates 209.85.214.180 as permitted sender) Received: from [209.85.214.180] (HELO mail-ob0-f180.google.com) (209.85.214.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Feb 2015 21:43:09 +0000 Received: by mail-ob0-f180.google.com with SMTP id vb8so26822077obc.11 for ; Fri, 20 Feb 2015 13:43:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3GLRL7i4uxT35UqZc85aOxvIH7CuLl47SwUBHcXuAPo=; b=0CQxk1ORkZxR99mwejeIjXFLHDlpf560N5E3SSHbaaeSuUL6irNJNvxjIvnZiGtrVJ 0OCnEsKKfGvBJmFCopp7AhDzxFjNdGouyREuUj4ARy6aKSi3GMLUA6wtmAy0JeG+T8qU S/Lw/5jU/QS7tfnesn1MTQXe04lCx2Z2T4cFaUQbmi9o/8DgFIaFr5x/YYnHkyXKj1rn z2w1ZNwDsh9nOeQNEoD12dn+0jVR7aVK+cVu6llE89VJD13faSZR53PRGq4R6pu7IYNW Vjv2aqT58UsCa+4FQMJK5K10KgnZzFdUZFDs5fguAqEhZQIz6ZSesBAsAL8Jz+kUCXod 1Bcw== MIME-Version: 1.0 X-Received: by 10.60.63.204 with SMTP id i12mr8053748oes.74.1424468587837; Fri, 20 Feb 2015 13:43:07 -0800 (PST) Received: by 10.76.71.69 with HTTP; Fri, 20 Feb 2015 13:43:07 -0800 (PST) In-Reply-To: References: Date: Fri, 20 Feb 2015 22:43:07 +0100 Message-ID: Subject: Re: Running Cassandra + Spark on AWS - architecture questions From: DuyHai Doan To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c2522c2bfd57050f8bf037 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2522c2bfd57050f8bf037 Content-Type: text/plain; charset=UTF-8 "Cassandra would take care of keeping the data synced between the two sets of five nodes. Is that correct?" Correct "But doing so means that we need 2x as many nodes as we need for the real-time cluster alone" Not necessarily. With multi DC you can configure the replication factor value per DC, meaning that you can have RF = 3 for the real time DC and RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be different for each DC In addition, you can also tune the hardware. If the realtime DC is mostly write only for incoming data and read-only from aggregated table, it is less IO intensive than the analytics DC with lot of read & write to compute aggregations. On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly wrote: > Hi all, > > I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed > workload Cassandra + Spark installation would look like, especially on > AWS. What I gather is that you use OpsCenter to set up the following: > > > - One "virtual data center" for real-time processing (e.g., ingestion > of time-series data, replying to requests for an interactive application) > - Another "virtual data center" for batch analytics (Spark, possibly > for machine learning) > > > If I understand this correctly, if I estimate that I need a five-node > cluster to handle all of my data, under the system described above, I would > have five nodes serving real-time traffic and all of the data replicated in > another five nodes that I use for batch processing. Cassandra would take > care of keeping the data synced between the two sets of five nodes. Is > that correct? > > I assume the motivation for such a dual-virtual-data-center architecture > is to prevent the Spark jobs (which are going to do lots of scans from > Cassandra, and maybe run computation on the same machines hosting > Cassandra) from disrupting the real-time performance. But doing so means > that we need 2x as many nodes as we need for the real-time cluster alone. > > *Could someone confirm that my interpretation above of what I read about > in the DSE documentation is correct?* > > If my application needs to run analytics on Spark only a few hours a day, > might we be better off spending our money to get a bigger Cassandra cluster > and then just spin up Spark jobs on EMR for a few hours at night? (I know > this is a hard question to answer, since it all depends on the > application---just curious if anyone else here has had to make similar > tradeoffs.) e.g., maybe instead of having a five-node real-time cluster, > we would have an eight-node real-time cluster, and use our remaining budget > on EMR jobs. > > I am curious if anyone has any thoughts / experience about this. > > Best regards, > Clint > --001a11c2522c2bfd57050f8bf037 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
"Cassand= ra would take care of keeping the data synced between the two sets of five = nodes.=C2=A0 Is that correct?"

Correct

&quo= t;But doing so means th= at we need 2x as many nodes as we need for the real-time cluster alone"= ;

=
Not necessarily. Wi= th multi DC you can configure the replication factor value per DC, meaning = that you can have RF =3D 3 for the real time DC and RF=3D1 or RF=3D2 for th= e analytics DC. Thus the number of nodes can be different for each DC

<= div>In addition, you can also = tune the hardware. If the realtime DC is mostly write only for incoming dat= a and read-only from aggregated table, it is less IO intensive than the ana= lytics DC with lot of read & write to compute aggregations.



On Fri, Feb 20, 2015 at 10= :17 PM, Clint Kelly <clint.kelly@gmail.com> wrote:
Hi all,

I = read the DSE 4.6 documentation and I'm still not 100% sure what a mixed= workload Cassandra + Spark installation would look like, especially on AWS= .=C2=A0 What I gather is that you use OpsCenter to set up the following:

  • One "virtual data center" for real= -time processing (e.g., ingestion of time-series data, replying to requests= for an interactive application)
  • Another "virtual data center&= quot; for batch analytics (Spark, possibly for machine learning)
<= div>
If I understand this correctly, if I estimate that I nee= d a five-node cluster to handle all of my data, under the system described = above, I would have five nodes serving real-time traffic and all of the dat= a replicated in another five nodes that I use for batch processing.=C2=A0 C= assandra would take care of keeping the data synced between the two sets of= five nodes.=C2=A0 Is that correct?

I assume= the motivation for such a dual-virtual-data-center architecture is to prev= ent the Spark jobs (which are going to do lots of scans from Cassandra, and= maybe run computation on the same machines hosting Cassandra) from disrupt= ing the real-time performance.=C2=A0 But doing so means that we need 2x as = many nodes as we need for the real-time cluster alone.

=
Could someone confirm that my interpretation above of what I rea= d about in the DSE documentation is correct?

If my application needs to run analytics on Spark only a few hours a day= , might we be better off spending our money to get a bigger Cassandra clust= er and then just spin up Spark jobs on EMR for a few hours at night? =C2=A0= (I know this is a hard question to answer, since it all depends on the appl= ication---just curious if anyone else here has had to make similar tradeoff= s.) =C2=A0e.g., maybe instead of having a five-node real-time cluster, we w= ould have an eight-node real-time cluster, and use our remaining budget on = EMR jobs.

I am curious if anyone has any thoughts = / experience about this.

Best regards,
C= lint

--001a11c2522c2bfd57050f8bf037--