cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Kołaczkowski (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7296) Add CL.COORDINATOR_ONLY
Date Wed, 17 Dec 2014 09:39:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249655#comment-14249655
] 

Piotr Kołaczkowski edited comment on CASSANDRA-7296 at 12/17/14 9:39 AM:
-------------------------------------------------------------------------

Honestly, I don't like this idea for Spark because of the following reasons:

# Seems like adding quite a lot of complexity to handle the following cases:
  ** What do we do if RF > 1 to avoid duplicates? 
  ** If we decide on primary token range only, what do we do if one of the nodes fail and
some primary token ranges have no node to query from? 
  ** What if the amount of data is large enough that we'd like to actually split token ranges
so that they are smaller and there are more spark tasks? This is important for bigger jobs
to protect from sudden failures and not having to recompute too much in case of a lost spark
partition.
  ** How do we fetch data from the same node in parallel? Currently it is perfectly fine to
have one Spark node using multiple cores (mappers) that fetch data from the same coordinator
node separately?
# It is trying to solve a theoretical problem which hasn't proved in practice yet.
  ** Russell Spitzer benchmarked vnodes on small/medium/larger data sets. No significant difference
on larger data sets, and only a tiny difference on really small sets (constant cost of the
query is higher than the cost of fetching the data).
  ** There are no customers reporting vnodes to be a problem for them.
  ** Theoretical reason: If data is large enough to not fit in page cache (hundreds of GBs
on a single node), 256 additional random seeks is not going to cause a huge penalty because:
  *** some of them can be hidden by splitting those queries between separate Spark threads,
so they would be submitted and executed in parallel
  *** each token range will be of size *hundreds* of MBs, which is enough large to hide one
or two seeks

Some *real* performance problems we (and users) observed:
 * Cassandra is taking plenty of CPU when doing sequential scans. It is not possible to saturate
bandwidth of a single laptop spinning HDD, because all cores of i7 CPU @2.4 GHz are 100% busy
processing those small CQL cells, merging rows from different SSTables, ordering cells, filtering
out tombstones, serializing etc. The problem doesn't go away after doing full compaction or
disabling vnodes. This is a serious problem, because doing exactly the same query on a plain
text file stored in CFS (still C*, but data stored as 2MB blobs) gives 3-30x performance boost
(depending on who did the benchmark). We need to close this gap. See: https://datastax.jira.com/browse/DSP-3670
 * We need to improve backpressure mechanism at least in such a way that the driver or Spark
connector would know to start throttling writes if the cluster doesn't keep up. Currently
Cassandra just timeouts the writes, but once it happens, the driver has no clue how long to
wait until it is ok to resubmit the update. It would be actually good to know long enough
before timing out, so we could slow down and avoid wasteful retrying at all. Currently it
is not possible to predict cluster load by e.g. observing write latency, because the latency
is extremely good until it is suddently terrible (timeout). This is also important for other
non-Spark related use cases. See https://issues.apache.org/jira/browse/CASSANDRA-7937.





was (Author: pkolaczk):
Honestly, I don't like this idea because of the following reasons:

# Seems like adding quite a lot of complexity to handle the following cases:
  ** What do we do if RF > 1 to avoid duplicates? 
  ** If we decide on primary token range only, what do we do if one of the nodes fail and
some primary token ranges have no node to query from? 
  ** What if the amount of data is large enough that we'd like to actually split token ranges
so that they are smaller and there are more spark tasks? This is important for bigger jobs
to protect from sudden failures and not having to recompute too much in case of a lost spark
partition.
  ** How do we fetch data from the same node in parallel? Currently it is perfectly fine to
have one Spark node using multiple cores (mappers) that fetch data from the same coordinator
node separately?
# It is trying to solve a theoretical problem which hasn't proved in practice yet.
  ** Russell Spitzer benchmarked vnodes on small/medium/larger data sets. No significant difference
on larger data sets, and only a tiny difference on really small sets (constant cost of the
query is higher than the cost of fetching the data).
  ** There are no customers reporting vnodes to be a problem for them.
  ** Theoretical reason: If data is large enough to not fit in page cache (hundreds of GBs
on a single node), 256 additional random seeks is not going to cause a huge penalty because:
  *** some of them can be hidden by splitting those queries between separate Spark threads,
so they would be submitted and executed in parallel
  *** each token range will be of size *hundreds* of MBs, which is enough large to hide one
or two seeks

Some *real* performance problems we (and users) observed:
 * Cassandra is taking plenty of CPU when doing sequential scans. It is not possible to saturate
bandwidth of a single laptop spinning HDD, because all cores of i7 CPU @2.4 GHz are 100% busy
processing those small CQL cells, merging rows from different SSTables, ordering cells, filtering
out tombstones, serializing etc. The problem doesn't go away after doing full compaction or
disabling vnodes. This is a serious problem, because doing exactly the same query on a plain
text file stored in CFS (still C*, but data stored as 2MB blobs) gives 3-30x performance boost
(depending on who did the benchmark). We need to close this gap. See: https://datastax.jira.com/browse/DSP-3670
 * We need to improve backpressure mechanism at least in such a way that the driver or Spark
connector would know to start throttling writes if the cluster doesn't keep up. Currently
Cassandra just timeouts the writes, but once it happens, the driver has no clue how long to
wait until it is ok to resubmit the update. It would be actually good to know long enough
before timing out, so we could slow down and avoid wasteful retrying at all. Currently it
is not possible to predict cluster load by e.g. observing write latency, because the latency
is extremely good until it is suddently terrible (timeout). This is also important for other
non-Spark related use cases. See https://issues.apache.org/jira/browse/CASSANDRA-7937.




> Add CL.COORDINATOR_ONLY
> -----------------------
>
>                 Key: CASSANDRA-7296
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7296
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tupshin Harper
>
> For reasons such as CASSANDRA-6340 and similar, it would be nice to have a read that
never gets distributed, and only works if the coordinator you are talking to is an owner of
the row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message