Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C2A419967 for ; Fri, 29 Apr 2016 03:55:13 +0000 (UTC) Received: (qmail 39440 invoked by uid 500); 29 Apr 2016 03:55:13 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 39408 invoked by uid 500); 29 Apr 2016 03:55:13 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 39047 invoked by uid 99); 29 Apr 2016 03:55:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Apr 2016 03:55:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id EC2672C1F61 for ; Fri, 29 Apr 2016 03:55:12 +0000 (UTC) Date: Fri, 29 Apr 2016 03:55:12 +0000 (UTC) From: "Stefania (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263494#comment-15263494 ] Stefania commented on CASSANDRA-11542: -------------------------------------- These are the results with the Spark Connector modified to support the streaming proof of concept. Results are in seconds and represent the average of 5 different runs, see [^spark-load-perf-results-002.zip] for the raw data. The improvement is approximately 30% for the RDD tests and 60% for the DF tests. Further, the data for schema 3 does not match the data observed in the previous run and the high variance continues to be observed. || ||SCHEMA 1|| ||SCHEMA 2|| ||SCHEMA 3|| ||SCHEMA 4|| || || Test|| Time|| Std. Dev|| Time|| Std. Dev|| Time|| Std. Dev|| Time|| Std. Dev|| | parquet_rdd|2.73|0.23|2.90|0.30|6.09|0.21|6.33|0.21| | parquet_df|2.87|0.72|2.68|0.62|4.65|0.78|4.40|0.32| | csv_rdd|5.31|0.21|5.18|0.24|6.58|0.11|6.50|0.12| | csv_df|12.26|1.00|12.31|0.28|13.03|0.25|13.04|0.19| | cassandra_rdd|49.72|2.80|46.57|2.77|19.75|0.58|39.83|18.72| |cassandra_rdd_stream|35.20|3.61|32.45|1.13|15.47|1.32|27.78|8.20| | cassandra_df|33.32|5.40|35.75|1.90|19.84|8.43|35.82|17.67| |cassandra_df_stream|20.76|2.91|21.06|0.72|12.80|0.47|22.70|9.00| I think there may be another dominating factor that explains these results, aside from the time it takes to receive data from Cassandra. The fact that the streaming improvement is more noticeable for DF rather than RDD tests, and significantly less than that noticed for [cassandra-stress benchmarks|https://issues.apache.org/jira/browse/CASSANDRA-9259?focusedCommentId=15228054&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15228054], may indicate that data decoding client-side plays a bigger role than streaming in the final performance results. I am going to attach flight recorder to a Spark worker to see if this assumption is correct. I still think we need CASSANDRA-11520 and CASSANDRA-11521, but I just want to make sure we tackle the bigger "bang for the buck" first. > Create a benchmark to compare HDFS and Cassandra bulk read times > ---------------------------------------------------------------- > > Key: CASSANDRA-11542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11542 > Project: Cassandra > Issue Type: Sub-task > Components: Testing > Reporter: Stefania > Assignee: Stefania > Fix For: 3.x > > Attachments: spark-load-perf-results-001.zip, spark-load-perf-results-002.zip > > > I propose creating a benchmark for comparing Cassandra and HDFS bulk reading performance. Simple Spark queries will be performed on data stored in HDFS or Cassandra, and the entire duration will be measured. An example query would be the max or min of a column or a count\(*\). > This benchmark should allow determining the impact of: > * partition size > * number of clustering columns > * number of value columns (cells) -- This message was sent by Atlassian JIRA (v6.3.4#6332)