cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Yen-Liang Su <xpste...@gmail.com>
Subject Re: Cassandra Database Modeling
Date Wed, 13 Apr 2011 04:16:00 GMT
>
> Is there a limit to the size that can be stored in one 'cell' (by 'cell' I
> mean the intersection between a *key* and a *data column*)? is there a
> limit to the size of data of one *key*?  one *data column*?
>

http://wiki.apache.org/cassandra/CassandraLimitations

<http://wiki.apache.org/cassandra/CassandraLimitations>The data of cassandra
are partitioned by the row key; therefore, if you want to put all pairs into
the same row, you should consider the disk size.


>
> Thanks in advance for any help / guidance.
>
> -----Original Message-----
> *From*: aaron morton <aaron@thelastpickle.com<aaron%20morton%20%3caaron@thelastpickle.com%3e>
> >
> *Reply-to*: user@cassandra.apache.org
> *To*: user@cassandra.apache.org
> *Subject*: Re: Cassandra Database Modeling
> *Date*: Wed, 13 Apr 2011 10:14:21 +1200
>
> Yes for  interactive == real time queries.  Hadoop based techniques are non
> time critical queries, but they do have greater analytical capabilities.
>
> particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api
> call will be used by your client library to pull back chunks of (ordered)
> columns. Most client libraries abstract away the chunking for you.
>
> 2) If you are using a packed structure like JSON then no, Cassandra will
> have no idea what you've put in the columns other than bytes . It really
> depends on how much data you have per pair, but generally it's easier to
> pull back more data than try to get exactly what you need. Downside is you
> have to update all the data.
>
> 3) No, you would need to update all the data for the pair. I was assuming
> most of the data was written once, and that your simulation had something
> like a stop-the-world phase between time slices where state was dumped and
> then read to start the next interval. You could either read it first, or we
> can come up with something else.
>
> distance_cf 1) the query would return an list of columns, which have a name
> and value (as well as a timestamp and ttl). 2) depends on the client
> library, if using python go for https://github.com/pycassa/pycassa It will
> return objects  3) returning millions of columns is going to be slow, would
> also be slow using a RDBMS. Creating millions objects in python is going to
> be slow. You would need to have a better idea of what queries you will
> actually want to run to see if it's *too* slow. If it is one approach is to
> store the particles at the same distance in the same column, so you need to
> read less columns. Again depends on how your sim works.     Time complexity
> depends on the number of columns read. Finding a row will not be O(1) as it
> it may have to read from several files. Writes are more constant than reads.
> But remember, you can have a lot of io and cpu power in your cluster.
>
> Best advice is to jump in and see if the data model works for you at a
> small single node scale, most performance issues can be solved.
>
> Aaron
> On 12 Apr 2011, at 15:34, csharpplusproject wrote:
>
> Hi Aaron,
>
> Yes, of course it helps, I am starting to get a flavor of *Cassandra* --
> thank you very much!
>
> First of all, by 'interactive' queries, are you referring to 'real-time'
> queries? (meaning, where experiments data is 'streaming', data needs to be
> stored and following that, the query needs to be run in real time)?
>
> *Looking at the design of the **particle pairs**:*
>
> - key: expriement_id.time_interval
> - column name: pair_id
> - column value: distance, angle, other data packed together as JSON or some
> other format
>
> *A couple of questions:*
>
> (1) Will a query such as *pairID[ *expriement_id.time_interval* ] *will
> basically return an array of all paidIDs for the experiment, where each item
> is a 'packed' JSON?
> (2) Would it be possible, rather than returning the whole JSON object per
> every pairID, to get (say) only the distance?
> (3) Would it be possible to easily update certain 'pairIDs' with new values
> (for example, update pairIDs = {2389, 93434} with new *distance* values)?
>
> *Looking at the design of the **distance CF* (for example)*:*
>
> this is VERY INTERESTING. basically you are suggesting a design that will
> save the actual distance between each pair of particles, and will allow
> queries where we can find all pairIDs (for an experiment, on time_interval)
> that meet a certain distance criteria. VERY, VERY INTERESTING!
>
> *A couple of questions:*
>
> (1) Will a query such as *distanceCF[ *expriement_id.time_interval* ] *will
> basically return an array of all '*zero_padded_distance.pair_id*' elements
> for the experiment?
> (2) In such a case, I will get (presumably) a python list where every item
> is a string (and I will need to process it)?
> (3) Given the fact that we're doing a slice on millions of columns (?), any
> idea how fast such an operation would be?
>
>
> Just to make sure I understand, is it true that in both situations, the
> query complexity is basically O(1) since it's simply a HASH?
>
>
> Thank you for all of your help!
>
> Shalom.
>
> -----Original Message-----
> *From*: aaron morton <aaron@thelastpickle.com<aaron%20morton%20%3caaron@thelastpickle.com%3e>
> >
> *Reply-to*: user@cassandra.apache.org
> *To*: user@cassandra.apache.org
> *Subject*: Re: Cassandra Database Modeling
> *Date*: Tue, 12 Apr 2011 10:43:42 +1200
>
> The tricky part here is the level of flexibility you want for the querying.
> In general you will want to denormalise to support the read queries.
>
> If your queries are not interactive you may be able to use Hadoop / Pig /
> Hive e.g. http://www.datastax.com/products/brisk In which case you can
> probably have a simpler data model where you spend less effort supporting
> the queries. But it sounds like you need interactive queries as part of the
> experiment.
>
> You could store the data per pair in a standard CF (lets call it the pair
> cf) as follows:
>
> - key: expriement_id.time_interval - column name: pair_id - column value:
> distance, angle, other data packed together as JSON or some other format
>
> This would support a basic record of what happened, for each time interval
> you can get the list of all pairs and read their data.
>
> To support your spatial queries you could use two standard standard CFs as
> follows:
>
> distance CF: - key: experiment_id.time_interval - colunm name:
> zero_padded_distance.pair_id - column value: empty or the angle
>
> angle CF : - key: experiment_id.time_interval - colunm name:
> zero_padded_angle.pair_id - column value: empty or the distance
>
> (two pairs can have the same distance and/or angle in same time slice)
>
> Here we are using the column name as a compound value, and am assuming they
> can be byte ordered. So for distance the column name looks something like
> 000500.123456789. You would then use the Byte comparator (or similar) for
> the columns.
>
> To find all of the particles for experiment 2 at t5 where distance is < 100
> you would use a get_slice (see http://wiki.apache.org/cassandra/API or
> your higher level client docs) against the key "2.5" with a SliceRange start
> at "000000.000000000" and finish at "000100.999999999". Once you have this
> list of columns you can either filter client side for the angle or issue
> another query for the particles inside the angle range. Then join the two
> results client side using the pair_id returned in the column names.
>
> By using the same key for all 3 CF's all the data for a time slice will be
> stored on the same nodes. You can potentially spread this around by using
> slightly different keys so they may hash to different areas of the cluster.
> e.g. expriement_id.time_interval."distance"
>
> Data volume is not a concern, and it's not possible to talk about
> performance until you have an idea of the workload and required throughput.
> But writes are fast and I think your reads would be fast as well as the row
> data for distance and angle will not change so caches will be be useful.
>
> Hope that helps.  Aaron
>
> On 12 Apr 2011, at 03:01, Shalom wrote:
>
> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
>
> So suppose that within a total experiment time of T1..T1000 (assume that T1
> is when the experiment starts, and T1000 is the time when the experiment
> ends) I would like, per each pair of particles, to measure the relationship
> between every Tn -- T(n+1) interval:
>
> T1..T2 (this is the first interval)
>
> T2..T3
>
> T3..T4
>
> ......
>
> ......
>
> T9,999,999..T10,000,000 (this is the last interval)
>
> For each such a particle pair (there are 10,000,000 pairs) I would like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
>
> Once saved, the query I will be using to retrieve this data is as follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle between
> the two particles is greater than Y". Meaning, the query will always take
> place for all particle pairs on a certain interval of time.
>
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a
> suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
>
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and would
> need a very decent performance of reads/writes.
>
> Is Cassandra suitable for this time of work?
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>
>
>
>
>
>
>
>
>

Mime
View raw message