Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 2 May 2016 08:53:12 +0000 (UTC)
From: "Stefania (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12957599.1460346015000.83653.1462179192964@Atlassian.JIRA>
In-Reply-To: <JIRA.12957599.1460346015000@Atlassian.JIRA>
References: <JIRA.12957599.1460346015000@Atlassian.JIRA> <JIRA.12957599.1460346015564@arcas>
Subject: [jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare
 HDFS and Cassandra bulk read times
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Mon, 02 May 2016 08:53:15 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-11542?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=
=3D15266239#comment-15266239 ]=20

Stefania commented on CASSANDRA-11542:
--------------------------------------

bq. Saw that we are always doing the conversion to CassandraRow with RDDs, =
dataframes go directly to the internal SQL Type.=20

In the dataframe tests we also only retrieve the two columns for the calcul=
ation rather than all columns . I described this above, sorry if it wasn't =
clear.

bq.The code you presented looks good to me, there is the potential issue of=
 blocking on resultsets that take a long time to complete while other resul=
t-sets are already on the driver but i'm not sure if this is a big deal. Do=
 you have any idea of the parallelization in these test? How many partition=
s are the different runs generating?

Thanks for checking the code. The result set futures should not block becau=
se the driver completes them as soon as they are transferred to the iterato=
r's thread. I'm actually using futures as a lazy way to also transfer error=
 conditions rather than just results.

In terms of parallelism, each C* node receives 256 token range queries per =
RDD iteration. This should be fine since each node has 256 tokens. I've als=
o checked the spark tasks by connecting to the web UI at port 4040 and init=
ially I could see 10 tasks per Cassandra RDD operation, then they increased=
 to 20 when I increased the number of executor cores to 4. I have 5 nodes w=
ith 2 executors each so the initial number 10 makes sense as by default the=
re is one core per executor, however I don't understand why I ended up with=
 20 rather than 40 when I increased the number of cores to 4. {{spark-env.s=
h}} is [here|https://github.com/stef1927/spark-load-perf/blob/master/bin/in=
stall_spark.sh#L34] if you want to check it out but there's not much to it =
other than the number of executor cores. I also note that the CSV and Parqu=
et RDD operations have as many tasks as there are HDFS partitions, so 1000 =
tasks. This would give them a big advantage if we have cores idle but I don=
't know how to reliably increase tasks for C* RDDs.

I've collected JFR files for both Cassandra and the Spark executors: [^jfr_=
recordings.zip]. I still need to analyze them but from a quick look there a=
re at least two interesting things client side (plus maybe a third one): we=
 seem to spend a lot of time in {{CassandraRow._indexOfOrThrow()}} and in s=
electing the codecs in the driver. As for the C* JFR recorder, we spend 80%=
 in the new bulk read code but we also still spend 15% of time in {{ReadCom=
mandVerbHandler}}, which I don't understand.

I will post another update when I have more details on the JFR analysis and=
 any optimizations that might follow.


> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1154=
2
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: jfr_recordings.zip, spark-load-perf-results-001.zip,=
 spark-load-perf-results-002.zip
>
>
> I propose creating a benchmark for comparing Cassandra and HDFS bulk read=
ing performance. Simple Spark queries will be performed on data stored in H=
DFS or Cassandra, and the entire duration will be measured. An example quer=
y would be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)