Hi Sylvain and Aaron,

Thanks for the comment Sylvain, what you say makes sense, I have microsecond precision timestamps and looking at some row printouts I see everything is happening at a different timestamp which means that it won't compare the second 100 bytes component.

As for the methodology it's not so thorough. I used Cassandra 0.8.5.

What I did is I had acquired a large data set about 300 hrs worth of data in Schema 1 (details below) which I found was easily hitting thousands of rows for some queries, thus giving me very poor performance. I converted this data to Schema 2 (details below) thus grouping the data together in the same row and increasing the time bucket for the row (with two versions "Timestamp:ID" and "ID:Timestamp" for the column names). So I obtained a CF with 66 rows, 11 rows for 3 different types of data sources which are dominant in the rates of info they give me (each row is a 24 hr time bucket).

These are the results I got using the CompositeQueryIterator (with a modified max of 100.000 cols returned per slice) taken from the Composite query tutorial at http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 (code is at https://github.com/zznate/cassandra-tutorial). So basically I used null for start and end in order to read entire rows at a time. I timed my code. The actual values are doubles for all 3 types. The size is the file size after dumping the results to a text file.

Ok, in my previous email I just looked at the rows with the max size which gave me a 20% difference. In earnest it's less.


Type1


ID:Timestamp Timestamp:ID

No. Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff %
387174 25M 12.59 8.6 31.68
1005113 66M 31.83 21.84 31.38
579633 38M 18.07 12.46 31.03
1217634 81M 33.77 24.65 26.99
376303 24M 12.32 10.36 15.94
2493007 169M 68.68 59.93 12.74
6298275 428M 183.28 147.57 19.48
2777962 189M 83.16 73.3 11.86
6138047 416M 170.88 155.83 8.81
3193450 216M 93.26 82.84 11.18
2302928 155M 69.91 61.62 11.85




Avg 19.3 %


Type 2


ID:Timestamp Timestamp:ID
No Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff %
350468 40M 12.92 13.12 -1.59
1303797 148M 43.33 38.98 10.04
697763 79M 26.78 22.05 17.66
825414 94M 33.5 26.69 20.31
55075 6.2M 2.97 2.13 28.15
1873775 213M 72.37 51.12 29.37
3982433 453M 147.04 110.71 24.71
1546491 176M 54.86 42.13 23.21
4117491 468M 143.1 114.62 19.9
1747506 199M 63.23 63.05 0.28
2720160 308M 96.06 82.47 14.14




Avg = 16.9 %


Type 3


ID:Timestamp Timestamp:ID
No Cols returned Size of file ExecTime (sec) ExecTime (sec) ExecTime Diff %
192667 7.2M 5.88 6.5 -10.49
210593 7.9M 6.33 5.57 12.06
144677 5.4M 3.78 3.74 1.22
207706 7.7M 6.33 5.74 9.28
235937 8.7M 6.34 6.11 3.64
159985 6.0M 4.23 3.93 7.07
134859 5.5M 3.91 3.38 13.46
70545 2.9M 2.96 2.08 29.84
98487 3.9M 4.04 2.62 35.22
205979 8.2M 7.35 5.67 22.87
166045 6.2M 5.12 3.99 22.1




Avg = 13.3 %


Just to understand why I did the tests.

Data set:
I have ~300.000 data sources. Each data source has several variables it can output values for. There are ~12 variables / data source. This gives ~4 million independent time series (let's call them streams) that need to go into Cassandra. The streams give me (timestamp,value) pairs at higly different rates, depending on the data source it comes from and operating conditions. This translates into very different row lengths if a unique time bucket is used across all streams.

The data sources can be further grouped in types (several data sources can share the same type). There are ~100 types.

Use case:
The system
- will serve a web dashboard.
- should allow queries at highest granularity for short periods of time (up to between 4-8hrs) on any individual stream or grouping of streams
- should allow a method of obtaining on demand (offline) analytics over long periods of time (up to 1 year) and then (real-time) querying on the analytics data

Cassandra schemes used so far:
Schema 1: 1 row for each of the 3 million streams. Each row is a 4hr time bucket.
Schema 2: 1 row for each of the 100 types. Each row is an 24hr time bucket.

Now I'm planning to use Schema 2 only with an 8hr time bucket to better reconcile between rows that get very long and ones that don't.

Cheers,
Alex


On Sat, Mar 31, 2012 at 9:35 PM, aaron morton <aaron@thelastpickle.com> wrote:
Can you post the details of the queries you are running, including the methodology of the tests ? 

(Here is the methodology I used to time queries previously http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/)

Cheers



-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 31/03/2012, at 1:29 AM, Alexandru Sicoe wrote:

Hi guys,
 I am consistently seeing a 20% improvement in query retrieval times if I use the composite comparator "Timestamp:ID" instead of "ID:Timestamp" where Timestamp=Long and ID=~100 character strings. I am retrieving all columns  (~1 million) from a single row. Why is this happening?

Cheers,
Alex