cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diane Griffith <dfgriff...@gmail.com>
Subject Re: horizontal query scaling issues follow on
Date Fri, 18 Jul 2014 18:43:49 GMT
Okay here are the data samples.

Column Family Schema again:
CREATE TABLE IF NOT EXISTS foo (key text, col_name text, col_value text,
PRIMARY KEY(key, col_name))

CQL Write:

INSERT INTO foo (key, col_name,col_value) VALUES
(“Type1:1109dccb-169b-40ef-b7f8-d072f04d8139”,”
HISTORY:2011-04-20T09:19:13.072-0400”,

“{"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state1","timestamp":1303305553072,"eventId":40902,"executionId":31082}”)



CQL Read:



SELECT col_value from foo where
key=”Type1:1109dccb-169b-40ef-b7f8-d072f04d8139“ and col_name=”LATEST“



Read result from above query:



{"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state3","timestamp":1303446284614,"eventId":7688,"executionId":40847}





CQL snippet example of select * from foo limit 8:



Key                  |     col_name              |
col_value





Type1:1109dccb-169b-40ef-b7f8-d072f04d8139      |
 HISTORY:2011-04-20T09:19:13.072-0400      |
{"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state1","timestamp":1303305553072,"eventId":40902,"executionId":31082}


 Type1:1109dccb-169b-40ef-b7f8-d072f04d8139      |
    HISTORY:2011-04-20T13:47:33.512-0400      |
       {"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state2","timestamp":1303321653512,"eventId":32660,"executionId":33510}


 Type1:1109dccb-169b-40ef-b7f8-d072f04d8139      |
    HISTORY:2011-04-22T00:24:44.614-0400      |
   {"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state3","timestamp":1303446284614,"eventId":7688,"executionId":40847}


 Type1:1109dccb-169b-40ef-b7f8-d072f04d8139      |     LATEST         |
{"key":"1109dccb-169b-40ef-b7f8-d072f04d8139","keyType":"
Type1","state":"state3","timestamp":1303446284614,"eventId":7688,"executionId":40847}


  Type2:e876d44d-246f-40c5-b5a3-4d0eb31db00d    |
   HISTORY:2010-08-26T03:45:43.366-0400       |
 {"key":"e876d44d-246f-40c5-b5a3-4d0eb31db00d","keyType":"
Type2","state":"state1","timestamp":1282808743366,"eventId":33332,"executionId":6214}


 Type2:e876d44d-246f-40c5-b5a3-4d0eb31db00d     |
 HISTORY:2010-08-26T04:58:46.810-0400       |
      {"key":"e876d44d-246f-40c5-b5a3-4d0eb31db00d","keyType":"
Type2","state":"state2","timestamp":1282813126810,"eventId":48575,"executionId":22318}


 Type2:e876d44d-246f-40c5-b5a3-4d0eb31db00d     |
 HISTORY:2010-08-27T22:39:51.036-0400       |
     {"key":"e876d44d-246f-40c5-b5a3-4d0eb31db00d","keyType":"
Type2","state":"state2","timestamp":1282963191036,"eventId":21960,"executionId":5067}


 Type2:e876d44d-246f-40c5-b5a3-4d0eb31db00d     |    LATEST        |
{"key":"e876d44d-246f-40c5-b5a3-4d0eb31db00d","keyType":"
Type2","state":"state2","timestamp":1282963191036,"eventId":21960,"executionId":5067}


For that above select * example, given how I have the primary key for the
schema to support dynamic wide rows, it was my understanding that it really
equates to data for 2 physical rows each with 4 cells.  So I should have 18
million physical rows but given the number of entries I inserted for each
key it equated to 72 million rows a select count(*) from foo will report if
I add the limit command to let it scan all rows.

Does anything seem like it is hurting our chances to horizontally scale
with the data/schema?

Thanks,
Diane


On Fri, Jul 18, 2014 at 6:46 AM, Benedict Elliott Smith <
belliottsmith@datastax.com> wrote:

> How many columns are you inserting/querying per key? Could we see some
> example CQL statements for the insert/read workload?
>
> If you are maxing out at 10 clients, something fishy is going on. In
> general, though, if you find that adding nodes causes performance to
> degrade I would suspect that you are querying data in one CQL statement
> that is spread over multiple partitions, and so extra work needs to be done
> cross-cluster to service your requests as more nodes are added.
>
> I would also consider what effect the file cache may be having on your
> workload, as it sounds small enough to fit in memory, so is likely a major
> determining factor for performance of your benchmark. As you try different
> client levels for the smaller cluster you may see improved performance as
> the data is pulled into file cache across test runs, and then when you
> build your larger cluster this is lost so performance appears to degrade
> (for instance).
>
>
> On Fri, Jul 18, 2014 at 12:25 PM, Diane Griffith <dfgriffith@gmail.com>
> wrote:
>
>> The column family schema is:
>>
>> CREATE TABLE IF NOT EXISTS foo (key text, col_name text, col_value text,
>> PRIMARY KEY(key, col_name))
>>
>> where the key is a generated uuid and all keys were inserted in random
>> order but in the end we were compacting down to one sstable per node.
>>
>> So we were doing it this way to achieve dynamic columns.
>>
>> Thanks,
>> Diane
>>
>> On Fri, Jul 18, 2014 at 12:19 AM, Jack Krupansky <jack@basetechnology.com
>> > wrote:
>>
>>>   Sorry I may have confused the discussion by mentioning tokens – I
>>> wasn’t intending to refer to vnodes or the num_tokens property, but merely
>>> referring to the token range of a node and that the partition key hashes to
>>> a token value.
>>>
>>> The main question is what you use for your primary key and whether you
>>> are using a small number of partition keys and a large number of clustering
>>> columns, or does each row have a unique partition key and no clustering
>>> columns.
>>>
>>> -- Jack Krupansky
>>>
>>>  *From:* Diane Griffith <dfgriffith@gmail.com>
>>> *Sent:* Thursday, July 17, 2014 6:21 PM
>>> *To:* user <user@cassandra.apache.org>
>>> *Subject:* Re: horizontal query scaling issues follow on
>>>
>>>  So do partitions equate to tokens/vnodes?
>>>
>>> If so we had configured all cluster nodes/vms with num_tokens: 256
>>> instead of setting init_token and assigning ranges.  I am still not getting
>>> why in Cassandra 2.0, I would assign my own ranges via init_token and this
>>> was based on the documentation and even this blog item
>>> <http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2> that
>>> made it seem right for us to always configure our cluster vms with
>>> num_tokens: 256 in the cassandra.yaml file.
>>>
>>> Also in all testing, all vms were of equal sizing so one was not more
>>> powerful than another.
>>>
>>> I didn't think I was hitting an i/o wall on the client vm (separate vm)
>>> where we command line scripted our query call to the cassandra cluster.
>>> I can break the client call load across vms which I tried early on.  Happy
>>> to verify that again though.
>>>
>>> So given that I was assuming the partitions were such that it wasn't a
>>> problem.  Is that an incorrect assumption and something to dig into more?
>>>
>>> Thanks,
>>> Diane
>>>
>>>
>>> On Thu, Jul 17, 2014 at 3:01 PM, Jack Krupansky <jack@basetechnology.com
>>> > wrote:
>>>
>>>>   How many partitions are you spreading those 18 million rows over?
>>>> That many rows in a single partition will not be a sweet spot for
>>>> Cassandra. It’s not exceeding any hard limit (2 billion), but some internal
>>>> operations may cache the partition rather than the logical row.
>>>>
>>>> And all those rows in a single partition would certainly not be a test
>>>> of “horizontal scaling” (adding nodes to handle more data – more token
>>>> values or partitions.)
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Diane Griffith <dfgriffith@gmail.com>
>>>> *Sent:* Thursday, July 17, 2014 1:33 PM
>>>> *To:* user <user@cassandra.apache.org>
>>>> *Subject:* horizontal query scaling issues follow on
>>>>
>>>>
>>>> This is a follow on re-post to clarify what we are trying to do,
>>>> providing information that was missing or not clear.
>>>>
>>>>
>>>>
>>>> Goal:  Verify horizontal scaling for random non duplicating key reads
>>>> using the simplest configuration (or minimal configuration) possible.
>>>>
>>>>
>>>>
>>>> Background:
>>>>
>>>> A couple years ago we did similar performance testing with Cassandra
>>>> for both read and write performance and found excellent (essentially
>>>> linear) horizontal scalability.  That project got put on hold.  We are now
>>>> moving forward with an operational system and are having scaling problems.
>>>>
>>>>
>>>>
>>>> During the prior testing (3 years ago) we were using a much older
>>>> version of Cassandra (0.8 or older), the THRIFT API, and Amazon AWS rather
>>>> than OpenStack VMs.  We are now using the latest Cassandra and the CQL
>>>> interface.  We did try moving from OpenStack to AWS/EC2 but that did not
>>>> materially change our (poor) results.
>>>>
>>>>
>>>>
>>>> Test Procedure:
>>>>
>>>>    - Inserted 54 million cells in 18 million rows (so 3 cells per
>>>>    row), using randomly generated row keys. That was to be our data control
>>>>    for the test.
>>>>    - Spawn a client on a different VM to query 100k rows and do that
>>>>    for 100 reps.  Each row key queried is drawn randomly from the set of
>>>>    existing row keys, and then not re-used, so all 10 million row queries
use
>>>>    a different (valid) row key.  This test is a specific use case of our
>>>>    system we are trying to show will scale
>>>>
>>>> Result:
>>>>
>>>>    - 2 nodes performed better than 1 node test but 4 nodes showed
>>>>    decreased performance over 2 nodes.  So that did not show horizontal scaling
>>>>
>>>>
>>>>
>>>> Notes:
>>>>
>>>>    - We have replication factor set to 1 as we were trying to keep the
>>>>    control test simple to prove out horizontal scaling.
>>>>    - When we tried to add threading to see if it would help it had
>>>>    interesting side behavior which did not prove out horizontal scaling.
>>>>    - We are using CQL versus THRIFT API for Cassandra 2.0.6
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Does anyone have any feedback that either threading or replication
>>>> factor is necessary to show horizontal scaling of Cassandra versus the
>>>> minimal way of just continue to add nodes to help throughput?
>>>>
>>>>
>>>>
>>>> Any suggestions of minimal configuration necessary to show scaling of
>>>> our query use case 100k requests for random non repeating keys constantly
>>>> coming in over a period of time?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Diane
>>>>
>>>
>>>
>>
>>
>

Mime
View raw message