incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From csharpplusproject <csharpplusproj...@gmail.com>
Subject Re: Cassandra Database Modeling
Date Tue, 12 Apr 2011 03:34:51 GMT
Hi Aaron,

Yes, of course it helps, I am starting to get a flavor of Cassandra --
thank you very much!

First of all, by 'interactive' queries, are you referring to 'real-time'
queries? (meaning, where experiments data is 'streaming', data needs to
be stored and following that, the query needs to be run in real time)?

Looking at the design of the particle pairs:

- key: expriement_id.time_interval 
- column name: pair_id 
- column value: distance, angle, other data packed together as JSON or
some other format

A couple of questions:

(1) Will a query such as pairID[ expriement_id.time_interval ] will
basically return an array of all paidIDs for the experiment, where each
item is a 'packed' JSON?
(2) Would it be possible, rather than returning the whole JSON object
per every pairID, to get (say) only the distance?
(3) Would it be possible to easily update certain 'pairIDs' with new
values (for example, update pairIDs = {2389, 93434} with new distance
values)? 

Looking at the design of the distance CF (for example):

this is VERY INTERESTING. basically you are suggesting a design that
will save the actual distance between each pair of particles, and will
allow queries where we can find all pairIDs (for an experiment, on
time_interval) that meet a certain distance criteria. VERY, VERY
INTERESTING!

A couple of questions:

(1) Will a query such as distanceCF[ expriement_id.time_interval ] will
basically return an array of all 'zero_padded_distance.pair_id' elements
for the experiment?
(2) In such a case, I will get (presumably) a python list where every
item is a string (and I will need to process it)?
(3) Given the fact that we're doing a slice on millions of columns (?),
any idea how fast such an operation would be?


Just to make sure I understand, is it true that in both situations, the
query complexity is basically O(1) since it's simply a HASH?


Thank you for all of your help!

Shalom.

-----Original Message-----
From: aaron morton <aaron@thelastpickle.com>
Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Tue, 12 Apr 2011 10:43:42 +1200

The tricky part here is the level of flexibility you want for the
querying. In general you will want to denormalise to support the read
queries.  


If your queries are not interactive you may be able to use Hadoop /
Pig / Hive e.g. http://www.datastax.com/products/brisk In which case you
can probably have a simpler data model where you spend less effort
supporting the queries. But it sounds like you need interactive queries
as part of the experiment.


You could store the data per pair in a standard CF (lets call it the
pair cf) as follows:


- key: expriement_id.time_interval
- column name: pair_id
- column value: distance, angle, other data packed together as JSON or
some other format


This would support a basic record of what happened, for each time
interval you can get the list of all pairs and read their data. 


To support your spatial queries you could use two standard standard CFs
as follows:


distance CF:
- key: experiment_id.time_interval
- colunm name: zero_padded_distance.pair_id
- column value: empty or the angle 


angle CF :
- key: experiment_id.time_interval
- colunm name: zero_padded_angle.pair_id
- column value: empty or the distance


(two pairs can have the same distance and/or angle in same time slice)


Here we are using the column name as a compound value, and am assuming
they can be byte ordered. So for distance the column name looks
something like 000500.123456789. You would then use the Byte comparator
(or similar) for the columns.  


To find all of the particles for experiment 2 at t5 where distance is <
100 you would use a get_slice
(see http://wiki.apache.org/cassandra/API or your higher level client
docs) against the key "2.5" with a SliceRange start at
"000000.000000000" and finish at "000100.999999999". Once you have this
list of columns you can either filter client side for the angle or issue
another query for the particles inside the angle range. Then join the
two results client side using the pair_id returned in the column names. 


By using the same key for all 3 CF's all the data for a time slice will
be stored on the same nodes. You can potentially spread this around by
using slightly different keys so they may hash to different areas of the
cluster. e.g. expriement_id.time_interval."distance"


Data volume is not a concern, and it's not possible to talk about
performance until you have an idea of the workload and required
throughput. But writes are fast and I think your reads would be fast as
well as the row data for distance and angle will not change so caches
will be be useful. 
 


Hope that helps. 
Aaron


On 12 Apr 2011, at 03:01, Shalom wrote:

> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
> 
> So suppose that within a total experiment time of T1..T1000 (assume
> that T1
> is when the experiment starts, and T1000 is the time when the
> experiment
> ends) I would like, per each pair of particles, to measure the
> relationship
> between every Tn -- T(n+1) interval:
> 
> T1..T2 (this is the first interval)
> 
> T2..T3
> 
> T3..T4
> 
> ......
> 
> ......
> 
> T9,999,999..T10,000,000 (this is the last interval)
> 
> For each such a particle pair (there are 10,000,000 pairs) I would
> like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
> 
> Once saved, the query I will be using to retrieve this data is as
> follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle
> between
> the two particles is greater than Y". Meaning, the query will always
> take
> place for all particle pairs on a certain interval of time.
> 
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a
> suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
> 
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and
> would
> need a very decent performance of reads/writes.
> 
> Is Cassandra suitable for this time of work?
> 
> 
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive
> at Nabble.com.
> 




Mime
View raw message