Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: softfail (athena.apache.org: transitioning domain of
 schemouil@gmail.com does not designate 62.210.136.60 as permitted sender)
Message-ID: <538C8955.2070905@gmail.com>
Date: Mon, 02 Jun 2014 16:25:25 +0200
From: Simon Chemouil <schemouil@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Performance migrating from MySQL to C*
References: 
 <CAAHZMzit69G9HA_=tJrE+PBiLwmgB9zRFcMM3HbmZtqovu+5FQ@mail.gmail.com>
 <84B566FB5B7B244B81E6F1FEADA90877027E54049B@LDNPCMMGMB01.INTRANET.BARCAPINT.COM>
 <538C2E26.3090301@gmail.com>
 <84B566FB5B7B244B81E6F1FEADA90877027E5404B5@LDNPCMMGMB01.INTRANET.BARCAPINT.COM>
In-Reply-To: 
 <84B566FB5B7B244B81E6F1FEADA90877027E5404B5@LDNPCMMGMB01.INTRANET.BARCAPINT.COM>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Thanks for your reply.

About that bit:
>  "If you get to this situation, the solution is not to monitor how
strangled Cassandra is - the solution is to come up with a data model
that avoids the strangulation. CQL is a nice syntactic layer, but, at
the end of the day, to avoid performance black holes, you have to
understand how the data is going to be stored by Cassandra in rows and
columns".

I interpret this as trying to minimize the amount of queries I send and
finding the right spot where I should rather query straight on
individual sensorIds (when the amount is reasonable, to avoid more than
100's of concurrent queries for a single "business" query) or retrieve
all data and then filter (either client-side, or if possible remotely to
avoid the IO/(de)serialization overhead).

I tried this already but my datamodel still took most of our available
CPUs and let very little room for other concurrent queries (whereas the
MySQL solution we have seems to be able to handle the same query on a
single CPU in the same time). I'll test more using your advice and see
where it leads me.

Thanks again,
Simon

Le 02/06/2014 14:03, moshe.kranc@barclays.com a écrit :
> From your email, I understand your use case a bit better - I now see that you want to query not just by dataName, but also by sensorId.
> 
> Still, it seems like the major filter for the query is the dataName (you search for a few dozen at a time). Within that, you want to filter on some (potentially large) number of sensorsId's.
> If this is true, the data model I proposed is optimal: You will need to read all the rows for the given dataNames being queried - this should be a relatively small number of rows. Then, you will have to filter the desired sensorId's from the result rows, via software that you write, that runs in your application and processes the returned rows.
> 
> If I still misunderstand you, and the major filter for your queries is sometimes a set of dataNames, and sometimes a set of sensorIds, then you should consider denormalizing, i.e., save the data in 2 tables, one of which uses dataName as the row key (to be queried when dataName is the major filter), and one of which uses sensorId as the row key (to be queried when sensorId is the major filter).
> 
> In either case, you will end up having to do additional filtering on the returned rows in your software, in order to filter on the minor filter key. IMHO, asking Cassandra to do this for you means (potentially) asking Cassandra to do full table scans, which is performance suicide. If you get to this situation, the solution is not to monitor how strangled Cassandra is - the solution is to come up with a data model that avoids the strangulation. CQL is a nice syntactic layer, but, at the end of the day, to avoid performance black holes, you have to understand how the data is going to be stored by Cassandra in rows and columns.
> 
> 
> -----Original Message-----
> From: Simon Chemouil [mailto:schemouil@gmail.com]
> Sent: Monday, June 02, 2014 10:56 AM
> To: user@cassandra.apache.org
> Subject: Re: Performance migrating from MySQL to C*
> 
> Hi Moshe,
> 
> Thanks for your answer and the link on time series.
> We'd like to query on more than one dataName, but also on the time range and on an arbitrary number of sensorIds. Which we can't seem to do with CQL because we can't have multiple IN clauses or IN clauses on the primary key. (hopefully this restriction will be lifted soon).
> 
> It seems to me that your solution won't let us filter on both time and sensorID (because current CQL limitations), and we'll have to issue a query for each sensorId *and* discriminator *and* dayRange *and* dataName. We have a thousand hundreds of sensorIds and we want to be able to query on one, two, six month, ... and we have maybe a dozen of dataNames we want to query on (among about ten thousands).
> 
> I'm afraid that this implies many, many queries, so how can we query Cassandra so much without making it unresponsive like we do? Are we supposed to queue our queries client-side with the Java driver and keep track of the number of running queries, compared to the expected capacity of our cluster?
> 
> Thanks again!
> 
> Cheers,
> Simon
> 
> 
> Le 28/05/2014 18:23, moshe.kranc@barclays.com a écrit :
>> Just looking at the data modeling issue:
>>
>>
>>
>> Your queries seem to always be for a single dataName. So, that should
>> be the main part of the row key.
>>
>> Within that, it seems you need to be able to select a range based on
>> time. So, time should be the primary sort key for the column name.
>>
>>
>>
>> Based on those requirements, I’d suggest you define the table as:
>>
>> row key: dataName, dayRange, discriminator
>>
>> column name: time, sensorId
>>
>> column value: dataValue
>>
>>
>>
>> As you can see, I’ve added a couple of fields to the row key:
>>
>> ·         dayRange: to prevent all the values for dataRange from forming
>> one monstrous row, break it up in chunks of X days. Set X too small
>> and you’ll have to perform a lot of row queries to answer queries
>> about months or years. Set X too large and you’ll have to do too much
>> I/O for queries that require only a day of info. I suggest X=5
>>
>> ·         discriminator: To prevent hot spots. If all your writes for a
>> given dataType over a 5 day period all go to the same C* node, you
>> have a hot spot. To prevent this, add this discriminator field, and
>> increment it for every write, modulo the number of C* nodes in your
>> cluster. (See
>> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
>> for a much better explanation of this.)
>>
>>
>>
>> In CQL terms, I believe it would look like this:
>>
>>
>>
>> CREATE TABLE sensorData (
>>
>>             dataName TEXT,
>>
>>             dayRange int,
>>
>>             discriminator int,
>>
>>             time TIMESTAMP,
>>
>>             sensorId bigint,
>>
>>             dataValue DOUBLE,
>>
>>             PRIMARY KEY ((dataName, dayRange, discriminator), time,
>> sensorId)
>>
>> );
>>
>>
>>
>>
>>
>> Hope this helps.
>>
>>
>>
>> *From:*Simon Chemouil [mailto:schemouil@gmail.com]
>> *Sent:* Wednesday, May 28, 2014 6:26 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Performance migrating from MySQL to C*
>>
>>
>>
>> Hi,
>>
>>
>>
>> First, sorry for the length of this mail. TL;DR: DataModeling
>> timeseries with an extra dimension, and C* not handling stress well;
>> MySQL doesn't scale as well but handles the queries way better on similar hardware.
>>
>>
>>
>> ==============
>>
>> Context:
>>
>>
>>
>> We've been evaluating Cassandra for a while now (~1 month) as a
>> replacement of our current MySQL based solution. While we're very
>> interested in the scalability promises of Cassandra, the results we
>> had so far are not as good as we expected.
>>
>>
>>
>> Our system is providing almost real-time analytics on our (quite
>> large, but definitely not 'Big data') dataset, and we are beginning to
>> wonder if Cassandra is the right tool or if we're simply doing something wrong.
>> We've spent a lot of effort trying to validate our usage of C*
>> internally so I would appreciate any pointers or ideas.
>>
>>
>>
>> I have read that Cassandra was not so good when it cames to reads, or
>> that it was more suited to returning smaller datasets, but I've also
>> noticed it is being more and more used and advertised as a Big Data
>> solution (e.g the recent partnership between DataBricks and DataStax).
>>
>>
>>
>> The problem we try to model is so: we have sensors (millions of them)
>> of different types (thousands of them), that each report many pieces
>> of data (typed double) every 5 minutes (00:00, 00:05, 00:10, ...,
>> 23:50, 23:55). That's about 735K timestamped values per year per data, per sensor.
>>
>>
>>
>> We want to be able, for instance, to compute the average value for a
>> given piece of data and a given set of sensors over a month as fast as
>> possible.
>>
>>
>>
>> ==============
>>
>> Setup:
>>
>>
>>
>> Cassandra 2.0.7, on a 32-cores Linux 64 machine, using XFS and 4TB
>> SSDs with 128 GB of RAM.
>>
>> DataStax Java Driver 2.0.2 with -Xmx16G. All queries using
>> PreparedStatements.
>>
>>
>>
>>
>>
>> ==============
>>
>> Data Model:
>>
>>
>>
>> We've tried several data models for this:
>>
>> CREATE TABLE sensorData (
>>
>>             day timestamp,
>>
>>             sensorId bigint,
>>
>>             time timestamp,
>>
>>             values map<text, double>,
>>
>>             PRIMARY KEY ((day, sensorId), time)
>>
>> );
>>
>>
>>
>> In this model, we cram all the data gathered by a single sensor into a
>> map, so that we can perform computations on-the-fly when we get the
>> data. The problem is that we sometime have up to 10K values stored
>> while we'd like to retrieve only 10, and Cassandra is not only unable
>> to let us select the map keys we're interested in, it is also unable
>> to partially read that cell... and it makes these queries slow.
>>
>>
>>
>> Instead we've moved towards storing each value in different tables,
>> with this model:
>>
>>
>>
>> CREATE TABLE sensorData (
>>
>>             sensorId bigint,
>>
>>             time TIMESTAMP,
>>
>>             dataName TEXT,
>>
>>             dataValue DOUBLE,
>>
>>             PRIMARY KEY ((dataName, sensorId), time)
>>
>> );
>>
>>
>>
>> Now, we have to synchronize the time field client-side, which is a bit
>> costly but at least we only get the data we need. We removed the day
>> component (which was used to further partition the data) and put the
>> dataName instead.
>>
>>
>>
>> We've also tried changing the compaction strategy (to
>> LeveledCompactionStrategy), removing the compression, and generally
>> tweaking our tables without any noticeable gain.
>>
>>
>>
>>
>>
>> Do these models seem OK for our purpose? They work fine when working
>> with a few hundred sensors, but how can we query 300K sensorIds
>> without killing Cassandra?
>>
>>
>>
>> I tried adding a secondary index on an extra-field (sensorTypeId) to
>> get everything and filter client-side, but then we lose the ability to
>> slice on the time. I tried introducing an extra info in the table name
>> itself (e.g sensorData_<day>) but not only it is ugly, but it also
>> increases the number of queries we have to send by the number of days
>> we query, and the amount of queries we send already seems too high for Cassandra.
>>
>>
>>
>> ==============
>>
>> Query volume:
>>
>>
>>
>> We want our queries to span from few sensorIds to hundred thousands of
>> them. We issue queries such as:
>>
>> SELECT * FROM sensorData WHERE dataName = 'yyy' AND sensorID IN
>> (<list>) AND time >= <startTime> AND time <= <endTime>;
>>
>>
>>
>> We are extremely limited in the size of our list. I read that IN
>> queries were not meant for large sets, but if we issue several queries
>> with smallers sets we often end-up with the same situation: timeout
>> exceptions in the Java driver and quite often dead Cassandra nodes.
>>
>>
>>
>> These are the kind of exceptions we often get:
>>
>>
>>
>> Exception in thread "Thread-4029" Exception in thread "Thread-3972"
>> com.datastax.driver.core.exceptions.NoHostAvailableException: All
>> host(s) tried for query failed (tried: [/172.17.11.104:9042
>> <http://172.17.11.104:9042>, /172.17.11.103:9042
>> <http://172.17.11.103:9042>, /172.17.11.102:9042
>> <http://172.17.11.102:9042>, /172.17.11.101:9042
>> <http://172.17.11.101:9042>] - use getErrors() for details)
>>
>>             at
>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHo
>> stAvailableException.java:65)
>>
>>             at
>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecut
>> ionException(DefaultResultSetFuture.java:256)
>>
>>             at
>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(Def
>> aultResultSetFuture.java:172)
>>
>>             at
>> com.datastax.driver.core.SessionManager.execute(SessionManager.java:92
>> )
>>
>>             at
>> com.davfx.cassandra.TestClient$1.run(TestClient.java:140)
>>
>>             at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
>> All host(s) tried for query failed (tried: /172.17.11.104:9042
>> <http://172.17.11.104:9042>
>> (com.datastax.driver.core.exceptions.DriverException: Timeout during
>> read), /172.17.11.103:9042 <http://172.17.11.103:9042>
>> (com.datastax.driver.core.exceptions.DriverException: Timeout during
>> read))
>>
>>             at
>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.jav
>> a:103)
>>
>>             at
>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:175)
>>
>>             at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
>> ava:1145)
>>
>>             at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>> java:615)
>>
>>             ... 1 more
>>
>>
>>
>>
>>
>> (before starting the query, Cassandra is running fine and showing up
>> as healthy in OpsCenter; smaller queries run fine. getErrors() doesn't
>> provide so much info).
>>
>>
>>
>> I get that these are timeouts, but is changing the timeout settings in
>> cassandra.yaml the right solution? It feels hackish. Other times,
>> Cassandra nodes just 'die' and we get the sensation that Cassandra is
>> a bit 'fragile' (we broke our cluster several times during our tests,
>> by sending too many queries concurrently or simply by creating a table
>> just after dropping it). Is there any recommended way to avoid
>> stressing too much Cassandra, without manually keeping track of
>> ongoing queries client-side?
>>
>>
>>
>> I noticed several JIRA issues that could probably help us reduce the
>> overhead we have (either by having useless data transfered over the
>> wire, or requiring us to flood SELECT queries to Cassandra) by
>> allowing multiple IN queries (CASSANDRA-4762), or filtering
>> server-side (CASSANDRA-4914, CASSANDRA-6377), but they haven't been updated in a while.
>>
>>
>>
>>
>>
>> ==============
>>
>> Performance:
>>
>>
>>
>> Finally, when we try to avoid stressing our cluster too much, we
>> manage to run our queries ('our' queries translate to several
>> Cassandra queries). However, where running the same queries on MySQL
>> will take only one CPU core to 100%, Cassandra takes our 32 cores to
>> 100% and doesn't reply any faster than MySQL. In practice we've found
>> MySQL to be able to concurrently run several queries, also suffering a
>> performance loss but not to the extent of Cassandra.
>>
>>
>>
>> We're looking at Cassandra today because we know that its scaling
>> capability is very superior to MySQL's. We know that adding servers
>> will help us increase throughput dramatically, but we also must be
>> able to keep decent performance on a setup similar to what we're
>> currently running. We are thus facing several problems:
>>
>> * it is a very hard sell if the performance is too far from MySQL's
>> (comparing time performance, and ability to handle the load of several
>> concurrent queries). We're not comparing on a single server, because
>> we know MySQL has been around longer and is very optimized at what it
>> does, but we'd at least expect Cassandra to do with 3 servers as good
>> as MySQL does with 2. We've been unable to demonstrate that so far :(.
>>
>> * when we stress Cassandra, we get timeouts, very high loads and even
>> make the process become unresponsive (doesn't necessarily 'crash')...
>> But since we are limited in the queries we can express, we have no
>> choice but to split them into many smaller queries (that would be
>> written in a single SQL query) which seems to be a significant overhead.
>> This is probably also a misuse from our side (even though we're simply
>> using the DataStax Java driver and sending a few queries with probably
>> too many elemets in the IN relation on the last component of the
>> partition key). Is there any recommended (preferrably built-in) way to
>> let Cassandra breathe while sending our queries so we're not crashing it?
>>
>> * our problem is two-dimensional ... we query on a given time range,
>> but also on a bunch of sensorIds (up to 500K). It is a difficult
>> problem generally, and we try to pre-compute as much as we can to
>> denormalize (e.g give an identifier to a sensor-set frequently used),
>> but our queries are very dynamic and we can only do so much. While
>> most NoSQL datastores don't seem to have any smarter solution for
>> this, we've found that MySQL does pretty good (by using different
>> indices or filtering server-side). How to model it best with Cassandra to keep its strengths?
>> Can we expect improvements in C* to help us deal with this kind of query?
>>
>>
>>
>> Which finally brings us to the more important question: do you feel
>> Cassandra is fit for our use-case? I've seen Cassandra being
>> advertised as a 'Big data' solution a lot (and we're working with
>> smaller datasets) so I'd expect it to be more resilient to stressful
>> usage and more feature-complete when it comes to reading large
>> datasets... Maybe I have overlooked some pieces of documentation. We
>> would be OK to try to adjust Cassandra to fit our needs and contribute
>> to the project, but we have to make sure that the project is going in a similar direction (big data, etc).
>>
>>
>>
>>
>>
>> Thanks for your help, comments are greatly appreciated.
>>
>>
>>
>> Simon
>>
>> _______________________________________________
>>
>> This message is for information purposes only, it is not a
>> recommendation, advice, offer or solicitation to buy or sell a product
>> or service nor an official confirmation of any transaction. It is
>> directed at persons who are professionals and is not intended for
>> retail customer use. Intended for recipient only. This message is
>> subject to the terms at: www.barclays.com/emaildisclaimer
>> <http://www.barclays.com/emaildisclaimer>.
>>
>> For important disclosures, please see:
>> www.barclays.com/salesandtradingdisclaimer
>> <http://www.barclays.com/salesandtradingdisclaimer> regarding market
>> commentary from Barclays Sales and/or Trading, who are active market
>> participants; and in respect of Barclays Research, including
>> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>>
>> _______________________________________________
>>
> 
> _______________________________________________
> 
> This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.
> 
> For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
> 
> _______________________________________________
>