cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diane Griffith <dfgriff...@gmail.com>
Subject Re: horizontal query scaling issues follow on
Date Thu, 17 Jul 2014 22:21:49 GMT
So do partitions equate to tokens/vnodes?

If so we had configured all cluster nodes/vms with num_tokens: 256 instead
of setting init_token and assigning ranges.  I am still not getting why in
Cassandra 2.0, I would assign my own ranges via init_token and this was
based on the documentation and even this blog item
<http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2> that made
it seem right for us to always configure our cluster vms with num_tokens:
256 in the cassandra.yaml file.

Also in all testing, all vms were of equal sizing so one was not more
powerful than another.

I didn't think I was hitting an i/o wall on the client vm (separate vm)
where we command line scripted our query call to the cassandra cluster.
 I can break the client call load across vms which I tried early on.  Happy
to verify that again though.

So given that I was assuming the partitions were such that it wasn't a
problem.  Is that an incorrect assumption and something to dig into more?

Thanks,
Diane


On Thu, Jul 17, 2014 at 3:01 PM, Jack Krupansky <jack@basetechnology.com>
wrote:

>   How many partitions are you spreading those 18 million rows over? That
> many rows in a single partition will not be a sweet spot for Cassandra.
> It’s not exceeding any hard limit (2 billion), but some internal operations
> may cache the partition rather than the logical row.
>
> And all those rows in a single partition would certainly not be a test of
> “horizontal scaling” (adding nodes to handle more data – more token values
> or partitions.)
>
> -- Jack Krupansky
>
>  *From:* Diane Griffith <dfgriffith@gmail.com>
> *Sent:* Thursday, July 17, 2014 1:33 PM
> *To:* user <user@cassandra.apache.org>
> *Subject:* horizontal query scaling issues follow on
>
>
> This is a follow on re-post to clarify what we are trying to do, providing
> information that was missing or not clear.
>
>
>
> Goal:  Verify horizontal scaling for random non duplicating key reads
> using the simplest configuration (or minimal configuration) possible.
>
>
>
> Background:
>
> A couple years ago we did similar performance testing with Cassandra for
> both read and write performance and found excellent (essentially linear)
> horizontal scalability.  That project got put on hold.  We are now moving
> forward with an operational system and are having scaling problems.
>
>
>
> During the prior testing (3 years ago) we were using a much older version
> of Cassandra (0.8 or older), the THRIFT API, and Amazon AWS rather than
> OpenStack VMs.  We are now using the latest Cassandra and the CQL
> interface.  We did try moving from OpenStack to AWS/EC2 but that did not
> materially change our (poor) results.
>
>
>
> Test Procedure:
>
>    - Inserted 54 million cells in 18 million rows (so 3 cells per row),
>    using randomly generated row keys. That was to be our data control for the
>    test.
>    - Spawn a client on a different VM to query 100k rows and do that for
>    100 reps.  Each row key queried is drawn randomly from the set of existing
>    row keys, and then not re-used, so all 10 million row queries use a
>    different (valid) row key.  This test is a specific use case of our system
>    we are trying to show will scale
>
> Result:
>
>    - 2 nodes performed better than 1 node test but 4 nodes showed
>    decreased performance over 2 nodes.  So that did not show horizontal scaling
>
>
>
> Notes:
>
>    - We have replication factor set to 1 as we were trying to keep the
>    control test simple to prove out horizontal scaling.
>    - When we tried to add threading to see if it would help it had
>    interesting side behavior which did not prove out horizontal scaling.
>    - We are using CQL versus THRIFT API for Cassandra 2.0.6
>
>
>
>
>
> Does anyone have any feedback that either threading or replication factor
> is necessary to show horizontal scaling of Cassandra versus the minimal way
> of just continue to add nodes to help throughput?
>
>
>
> Any suggestions of minimal configuration necessary to show scaling of our
> query use case 100k requests for random non repeating keys constantly
> coming in over a period of time?
>
>
> Thanks,
>
> Diane
>

Mime
View raw message