cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russell Alexander Spitzer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times
Date Fri, 29 Apr 2016 05:10:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263553#comment-15263553
] 

Russell Alexander Spitzer commented on CASSANDRA-11542:
-------------------------------------------------------

You may also want to do tests reading into CaseClasses rather than CassandraRows, 

{code}
case class RowName( col:Type, col2: type, ....)
sc.cassandraTable[RowName]{code}

This may explain some of the difference between RDD and DataFrame read times as Dataframes
(SqlRows vs CassandraRows) read into a different format than RDDs by default and case classes
should be much more efficient than the map based CassandraRows. In addition I think the parquet
versions are able to skip full counts (because of the metadata) but i'm not really sure about
that which may give them the advantage over CSV ... Again not sure it could just be the compression
of repeated values 

> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11542
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: spark-load-perf-results-001.zip, spark-load-perf-results-002.zip
>
>
> I propose creating a benchmark for comparing Cassandra and HDFS bulk reading performance.
Simple Spark queries will be performed on data stored in HDFS or Cassandra, and the entire
duration will be measured. An example query would be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message