cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Jongsma <>
Subject Re: Large number of row keys in query kills cluster
Date Wed, 11 Jun 2014 14:33:27 GMT
I'm using Astyanax with a query like this:

  .getKeySlice(new String[] {
    // 20,000 keys here...

At the time this query executes the first time (resulting in unresponsive
cluster), there are zero rows in the column family. Schema is below, pretty

CREATE KEYSPACE instruments WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'aws-us-east-1': '2'

CREATE TABLE instruments (
  key bigint PRIMARY KEY,
  definition blob,
  id bigint,
  name text,
  symbol text,
  updated bigint
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

On Tue, Jun 10, 2014 at 6:35 PM, Laing, Michael <>

> Perhaps if you described both the schema and the query in more detail, we
> could help... e.g. did the query have an IN clause with 20000 keys? Or is
> the key compound? More detail will help.
> On Tue, Jun 10, 2014 at 7:15 PM, Jeremy Jongsma <>
> wrote:
>> I didn't explain clearly - I'm not requesting 20000 unknown keys
>> (resulting in a full scan), I'm requesting 20000 specific rows by key.
>> On Jun 10, 2014 6:02 PM, "DuyHai Doan" <> wrote:
>>> Hello Jeremy
>>> Basically what you are doing is to ask Cassandra to do a distributed
>>> full scan on all the partitions across the cluster, it's normal that the
>>> nodes are somehow.... stressed.
>>> How did you make the query? Are you using Thrift or CQL3 API?
>>> Please note that there is another way to get all partition keys : SELECT
>>> DISTINCT <partition_key> FROM..., more details here :
>>> I ran an application today that attempted to fetch 20,000+ unique row
>>> keys in one query against a set of completely empty column families. On a
>>> 4-node cluster (EC2 m1.large instances) with the recommended memory
>>> settings (2 GB heap), every single node immediately ran out of memory and
>>> became unresponsive, to the point where I had to kill -9 the cassandra
>>> processes.
>>> Now clearly this query is not the best idea in the world, but the
>>> effects of it are a bit disturbing. What could be going on here? Are there
>>> any other query pitfalls I should be aware of that have the potential to
>>> explode the entire cluster?
>>> -j

View raw message