cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times
Date Mon, 02 May 2016 08:53:12 GMT


Stefania commented on CASSANDRA-11542:

bq. Saw that we are always doing the conversion to CassandraRow with RDDs, dataframes go directly
to the internal SQL Type. 

In the dataframe tests we also only retrieve the two columns for the calculation rather than
all columns . I described this above, sorry if it wasn't clear.

bq.The code you presented looks good to me, there is the potential issue of blocking on resultsets
that take a long time to complete while other result-sets are already on the driver but i'm
not sure if this is a big deal. Do you have any idea of the parallelization in these test?
How many partitions are the different runs generating?

Thanks for checking the code. The result set futures should not block because the driver completes
them as soon as they are transferred to the iterator's thread. I'm actually using futures
as a lazy way to also transfer error conditions rather than just results.

In terms of parallelism, each C* node receives 256 token range queries per RDD iteration.
This should be fine since each node has 256 tokens. I've also checked the spark tasks by connecting
to the web UI at port 4040 and initially I could see 10 tasks per Cassandra RDD operation,
then they increased to 20 when I increased the number of executor cores to 4. I have 5 nodes
with 2 executors each so the initial number 10 makes sense as by default there is one core
per executor, however I don't understand why I ended up with 20 rather than 40 when I increased
the number of cores to 4. {{}} is [here|]
if you want to check it out but there's not much to it other than the number of executor cores.
I also note that the CSV and Parquet RDD operations have as many tasks as there are HDFS partitions,
so 1000 tasks. This would give them a big advantage if we have cores idle but I don't know
how to reliably increase tasks for C* RDDs.

I've collected JFR files for both Cassandra and the Spark executors: [^].
I still need to analyze them but from a quick look there are at least two interesting things
client side (plus maybe a third one): we seem to spend a lot of time in {{CassandraRow._indexOfOrThrow()}}
and in selecting the codecs in the driver. As for the C* JFR recorder, we spend 80% in the
new bulk read code but we also still spend 15% of time in {{ReadCommandVerbHandler}}, which
I don't understand.

I will post another update when I have more details on the JFR analysis and any optimizations
that might follow.

> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>                 Key: CASSANDRA-11542
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>         Attachments:,,
> I propose creating a benchmark for comparing Cassandra and HDFS bulk reading performance.
Simple Spark queries will be performed on data stored in HDFS or Cassandra, and the entire
duration will be measured. An example query would be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)

This message was sent by Atlassian JIRA

View raw message