incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pushkar Prasad" <>
Subject Overheads in fetching many (500K) rows for a partitionID
Date Wed, 20 Mar 2013 06:11:52 GMT
With the following schema:


- TimeStamp

- Device ID

- Device Name

- Device Owner

- Device Color


PKEY (TimeStamp, DeviceID)

Each record is 40 bytes.


I'm trying to fetch all the rows for a particular TimeStamp (partitionID). 


Select * from schema where TimeStamp = '.'


There are 500K such rows per timestamp. I have figured out that doing
pagination would give a much better throughput than trying to fetch all in
one shot. So to fetch 500 K rows (40 MB), using page size of 1000 / 10000,
it took around 25-30 seconds. I have following question:


(A) Will all the data that I'm querying be stored sequentially in disk for a
particular TimeStamp (and yes, I've run compact command)?

(B) If answer to first qn is yes, then why am I not able to get throughput
equal to disk (40 MB/s)? Please note that I'm able to retrieve 40 MB worth
of data in 25-30 seconds, which translates to hardly 1.5 MB/s.

(C) If answer to above first question is yes, then could I further speed up
the response? 

(D) Is serialization / deserialization the culprit for slow throughput? If
so, can something be done to avoid it altogether?



View raw message