Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 062CE200ACC for ; Mon, 2 May 2016 10:53:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 04C171609A1; Mon, 2 May 2016 10:53:15 +0200 (CEST) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4B1591609B0 for ; Mon, 2 May 2016 10:53:14 +0200 (CEST) Received: (qmail 61576 invoked by uid 500); 2 May 2016 08:53:13 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 61471 invoked by uid 99); 2 May 2016 08:53:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2016 08:53:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id EC10F2C1F6A for ; Mon, 2 May 2016 08:53:12 +0000 (UTC) Date: Mon, 2 May 2016 08:53:12 +0000 (UTC) From: "Stefania (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 02 May 2016 08:53:15 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-11542?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D15266239#comment-15266239 ]=20 Stefania commented on CASSANDRA-11542: -------------------------------------- bq. Saw that we are always doing the conversion to CassandraRow with RDDs, = dataframes go directly to the internal SQL Type.=20 In the dataframe tests we also only retrieve the two columns for the calcul= ation rather than all columns . I described this above, sorry if it wasn't = clear. bq.The code you presented looks good to me, there is the potential issue of= blocking on resultsets that take a long time to complete while other resul= t-sets are already on the driver but i'm not sure if this is a big deal. Do= you have any idea of the parallelization in these test? How many partition= s are the different runs generating? Thanks for checking the code. The result set futures should not block becau= se the driver completes them as soon as they are transferred to the iterato= r's thread. I'm actually using futures as a lazy way to also transfer error= conditions rather than just results. In terms of parallelism, each C* node receives 256 token range queries per = RDD iteration. This should be fine since each node has 256 tokens. I've als= o checked the spark tasks by connecting to the web UI at port 4040 and init= ially I could see 10 tasks per Cassandra RDD operation, then they increased= to 20 when I increased the number of executor cores to 4. I have 5 nodes w= ith 2 executors each so the initial number 10 makes sense as by default the= re is one core per executor, however I don't understand why I ended up with= 20 rather than 40 when I increased the number of cores to 4. {{spark-env.s= h}} is [here|https://github.com/stef1927/spark-load-perf/blob/master/bin/in= stall_spark.sh#L34] if you want to check it out but there's not much to it = other than the number of executor cores. I also note that the CSV and Parqu= et RDD operations have as many tasks as there are HDFS partitions, so 1000 = tasks. This would give them a big advantage if we have cores idle but I don= 't know how to reliably increase tasks for C* RDDs. I've collected JFR files for both Cassandra and the Spark executors: [^jfr_= recordings.zip]. I still need to analyze them but from a quick look there a= re at least two interesting things client side (plus maybe a third one): we= seem to spend a lot of time in {{CassandraRow._indexOfOrThrow()}} and in s= electing the codecs in the driver. As for the C* JFR recorder, we spend 80%= in the new bulk read code but we also still spend 15% of time in {{ReadCom= mandVerbHandler}}, which I don't understand. I will post another update when I have more details on the JFR analysis and= any optimizations that might follow. > Create a benchmark to compare HDFS and Cassandra bulk read times > ---------------------------------------------------------------- > > Key: CASSANDRA-11542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1154= 2 > Project: Cassandra > Issue Type: Sub-task > Components: Testing > Reporter: Stefania > Assignee: Stefania > Fix For: 3.x > > Attachments: jfr_recordings.zip, spark-load-perf-results-001.zip,= spark-load-perf-results-002.zip > > > I propose creating a benchmark for comparing Cassandra and HDFS bulk read= ing performance. Simple Spark queries will be performed on data stored in H= DFS or Cassandra, and the entire duration will be measured. An example quer= y would be the max or min of a column or a count\(*\). > This benchmark should allow determining the impact of: > * partition size > * number of clustering columns > * number of value columns (cells) -- This message was sent by Atlassian JIRA (v6.3.4#6332)