Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 90516 invoked from network); 13 Apr 2011 04:16:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Apr 2011 04:16:50 -0000 Received: (qmail 78579 invoked by uid 500); 13 Apr 2011 04:16:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 78476 invoked by uid 500); 13 Apr 2011 04:16:48 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 78468 invoked by uid 99); 13 Apr 2011 04:16:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 04:16:47 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of xpsteven@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 04:16:41 +0000 Received: by bwz13 with SMTP id 13so286080bwz.31 for ; Tue, 12 Apr 2011 21:16:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=h83Pc+AmOeZgEKRSCc9wcHB9+34V+zsBP0zW1XLyH38=; b=uM4SYbaFHs/ADI4/bGBBm/HSnFMf9Gls18Kz+2B51jqyaoOqU60xOfTY0KtT4hPQNs GUIMU/q1Ok1CjfwwdwwU0pxJv1uPPTLWsnPgVsinuo4Hd8egAJe7cckO+/NL8Psg5qPp jO9gG+kLcz3X5eB2V2kxelDuPDRFPMGGAr9tE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=YvTI7HF8xP1Ht1L0KuFw6fs47FvpTDybXPvQr8jQBtOKvUAaHsBZeA5MGRmjegH6ai RReNHsdHjDCqAJDXBQntXgKJHwVqcjI+T9y3vxg85+V8HaD+0CIYZus4PLNcY2yHrLXP 7ZH1dNfDWjNpYKbf7L533yYC7/+cCSs7z9psU= Received: by 10.204.29.18 with SMTP id o18mr1759280bkc.12.1302668180224; Tue, 12 Apr 2011 21:16:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.37.6 with HTTP; Tue, 12 Apr 2011 21:16:00 -0700 (PDT) In-Reply-To: <1302666493.6089.518.camel@internet> References: <1302534093115-6261778.post@n2.nabble.com> <7944047F-A7AF-4875-9A2C-4A4151BF607C@thelastpickle.com> <1302579291.1767.3812.camel@internet> <68AD293A-A811-46B5-BA4E-9B56FA3C88B3@thelastpickle.com> <1302666493.6089.518.camel@internet> From: Steven Yen-Liang Su Date: Wed, 13 Apr 2011 12:16:00 +0800 Message-ID: Subject: Re: Cassandra Database Modeling To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00032555f692247c7d04a0c51260 X-Virus-Checked: Checked by ClamAV on apache.org --00032555f692247c7d04a0c51260 Content-Type: text/plain; charset=UTF-8 > > Is there a limit to the size that can be stored in one 'cell' (by 'cell' I > mean the intersection between a *key* and a *data column*)? is there a > limit to the size of data of one *key*? one *data column*? > http://wiki.apache.org/cassandra/CassandraLimitations The data of cassandra are partitioned by the row key; therefore, if you want to put all pairs into the same row, you should consider the disk size. > > Thanks in advance for any help / guidance. > > -----Original Message----- > *From*: aaron morton > > > *Reply-to*: user@cassandra.apache.org > *To*: user@cassandra.apache.org > *Subject*: Re: Cassandra Database Modeling > *Date*: Wed, 13 Apr 2011 10:14:21 +1200 > > Yes for interactive == real time queries. Hadoop based techniques are non > time critical queries, but they do have greater analytical capabilities. > > particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api > call will be used by your client library to pull back chunks of (ordered) > columns. Most client libraries abstract away the chunking for you. > > 2) If you are using a packed structure like JSON then no, Cassandra will > have no idea what you've put in the columns other than bytes . It really > depends on how much data you have per pair, but generally it's easier to > pull back more data than try to get exactly what you need. Downside is you > have to update all the data. > > 3) No, you would need to update all the data for the pair. I was assuming > most of the data was written once, and that your simulation had something > like a stop-the-world phase between time slices where state was dumped and > then read to start the next interval. You could either read it first, or we > can come up with something else. > > distance_cf 1) the query would return an list of columns, which have a name > and value (as well as a timestamp and ttl). 2) depends on the client > library, if using python go for https://github.com/pycassa/pycassa It will > return objects 3) returning millions of columns is going to be slow, would > also be slow using a RDBMS. Creating millions objects in python is going to > be slow. You would need to have a better idea of what queries you will > actually want to run to see if it's *too* slow. If it is one approach is to > store the particles at the same distance in the same column, so you need to > read less columns. Again depends on how your sim works. Time complexity > depends on the number of columns read. Finding a row will not be O(1) as it > it may have to read from several files. Writes are more constant than reads. > But remember, you can have a lot of io and cpu power in your cluster. > > Best advice is to jump in and see if the data model works for you at a > small single node scale, most performance issues can be solved. > > Aaron > On 12 Apr 2011, at 15:34, csharpplusproject wrote: > > Hi Aaron, > > Yes, of course it helps, I am starting to get a flavor of *Cassandra* -- > thank you very much! > > First of all, by 'interactive' queries, are you referring to 'real-time' > queries? (meaning, where experiments data is 'streaming', data needs to be > stored and following that, the query needs to be run in real time)? > > *Looking at the design of the **particle pairs**:* > > - key: expriement_id.time_interval > - column name: pair_id > - column value: distance, angle, other data packed together as JSON or some > other format > > *A couple of questions:* > > (1) Will a query such as *pairID[ *expriement_id.time_interval* ] *will > basically return an array of all paidIDs for the experiment, where each item > is a 'packed' JSON? > (2) Would it be possible, rather than returning the whole JSON object per > every pairID, to get (say) only the distance? > (3) Would it be possible to easily update certain 'pairIDs' with new values > (for example, update pairIDs = {2389, 93434} with new *distance* values)? > > *Looking at the design of the **distance CF* (for example)*:* > > this is VERY INTERESTING. basically you are suggesting a design that will > save the actual distance between each pair of particles, and will allow > queries where we can find all pairIDs (for an experiment, on time_interval) > that meet a certain distance criteria. VERY, VERY INTERESTING! > > *A couple of questions:* > > (1) Will a query such as *distanceCF[ *expriement_id.time_interval* ] *will > basically return an array of all '*zero_padded_distance.pair_id*' elements > for the experiment? > (2) In such a case, I will get (presumably) a python list where every item > is a string (and I will need to process it)? > (3) Given the fact that we're doing a slice on millions of columns (?), any > idea how fast such an operation would be? > > > Just to make sure I understand, is it true that in both situations, the > query complexity is basically O(1) since it's simply a HASH? > > > Thank you for all of your help! > > Shalom. > > -----Original Message----- > *From*: aaron morton > > > *Reply-to*: user@cassandra.apache.org > *To*: user@cassandra.apache.org > *Subject*: Re: Cassandra Database Modeling > *Date*: Tue, 12 Apr 2011 10:43:42 +1200 > > The tricky part here is the level of flexibility you want for the querying. > In general you will want to denormalise to support the read queries. > > If your queries are not interactive you may be able to use Hadoop / Pig / > Hive e.g. http://www.datastax.com/products/brisk In which case you can > probably have a simpler data model where you spend less effort supporting > the queries. But it sounds like you need interactive queries as part of the > experiment. > > You could store the data per pair in a standard CF (lets call it the pair > cf) as follows: > > - key: expriement_id.time_interval - column name: pair_id - column value: > distance, angle, other data packed together as JSON or some other format > > This would support a basic record of what happened, for each time interval > you can get the list of all pairs and read their data. > > To support your spatial queries you could use two standard standard CFs as > follows: > > distance CF: - key: experiment_id.time_interval - colunm name: > zero_padded_distance.pair_id - column value: empty or the angle > > angle CF : - key: experiment_id.time_interval - colunm name: > zero_padded_angle.pair_id - column value: empty or the distance > > (two pairs can have the same distance and/or angle in same time slice) > > Here we are using the column name as a compound value, and am assuming they > can be byte ordered. So for distance the column name looks something like > 000500.123456789. You would then use the Byte comparator (or similar) for > the columns. > > To find all of the particles for experiment 2 at t5 where distance is < 100 > you would use a get_slice (see http://wiki.apache.org/cassandra/API or > your higher level client docs) against the key "2.5" with a SliceRange start > at "000000.000000000" and finish at "000100.999999999". Once you have this > list of columns you can either filter client side for the angle or issue > another query for the particles inside the angle range. Then join the two > results client side using the pair_id returned in the column names. > > By using the same key for all 3 CF's all the data for a time slice will be > stored on the same nodes. You can potentially spread this around by using > slightly different keys so they may hash to different areas of the cluster. > e.g. expriement_id.time_interval."distance" > > Data volume is not a concern, and it's not possible to talk about > performance until you have an idea of the workload and required throughput. > But writes are fast and I think your reads would be fast as well as the row > data for distance and angle will not change so caches will be be useful. > > Hope that helps. Aaron > > On 12 Apr 2011, at 03:01, Shalom wrote: > > I would like to save statistics on 10,000,000 (ten millions) pairs of > particles, how they relate to one another in any given space in time. > > So suppose that within a total experiment time of T1..T1000 (assume that T1 > is when the experiment starts, and T1000 is the time when the experiment > ends) I would like, per each pair of particles, to measure the relationship > between every Tn -- T(n+1) interval: > > T1..T2 (this is the first interval) > > T2..T3 > > T3..T4 > > ...... > > ...... > > T9,999,999..T10,000,000 (this is the last interval) > > For each such a particle pair (there are 10,000,000 pairs) I would like to > save some figures (such as distance, angel etc) on each interval of [ > Tn..T(n+1) ] > > Once saved, the query I will be using to retrieve this data is as follows: > "give me all particle pairs on time interval [ Tn..T(n+1) ] where the > distance between the two particles is smaller than X and the angle between > the two particles is greater than Y". Meaning, the query will always take > place for all particle pairs on a certain interval of time. > > How would you model this in Cassandra, so that the writes/reads are > optimized? given the database size involved, can you recommend on a > suitable > solution? (I have been recommended to both MongoDB / Cassandra). > > I should mention that the data does change often -- we run many such > experiments (different particle sets / thousands of experiments) and would > need a very decent performance of reads/writes. > > Is Cassandra suitable for this time of work? > > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html > Sent from the cassandra-user@incubator.apache.org mailing list archive at > Nabble.com. > > > > > > > > > --00032555f692247c7d04a0c51260 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Is there a = limit to the size that can be stored in one 'cell' (by 'cell= 9; I mean the intersection between a key<= /i> and a data column)? is there a li= mit to the size of data of one key?= =C2=A0 one data column?


The data of cassandra are partitioned by the row key; theref= ore, if you want to put all pairs into the same row, you should consider th= e disk size.
=C2=A0

Thanks in advance for any help / guidance.

-----Original Message-----
From: aaron morton <aaron@thelastpickle.com>
Reply-to: user@cassandra.apache.org
To: u= ser@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Wed, 13 Apr 2011 10:14:21 +1200

Yes for =C2=A0interactive =3D=3D real time queries. =C2=A0Hadoop based tech= niques are non time critical queries, but they do have greater analytical c= apabilities.=C2=A0

particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api call will be us= ed by your client library to pull back chunks of (ordered) columns. Most cl= ient libraries abstract away the chunking for you.=C2=A0

2) If you are using a packed structure like JSON then no, Cassandra will ha= ve no idea what you've put in the columns other than bytes . It really = depends on how much data you have per pair, but generally it's easier t= o pull back more data than try to get exactly what you need. Downside is yo= u have to update all the data.=C2=A0

3) No, you would need to update all the data for the pair. I was assuming m= ost of the data was written once, and that your simulation had something li= ke a stop-the-world phase between time slices where state was dumped and th= en read to start the next interval. You could either read it first, or we c= an come up with something else.

distance_cf 1) the query would return an list of columns, which have a name and value (= as well as a timestamp and ttl). 2) depends on the client library, if using python go for=C2=A0https://github.com/pyca= ssa/pycassa=C2=A0It will return objects=C2=A0 3) returning millions of columns is going to be slow, would also be slow us= ing a RDBMS. Creating millions objects in python is going to be slow. You w= ould need to have a better idea of what queries you will actually want to r= un to see if it's *too* slow. If it is one approach is to store the par= ticles at the same distance in the same column, so you need to read less co= lumns. Again depends on how your sim works.=C2=A0 =C2=A0=C2=A0 Time complexity depends on the number of columns read. Finding a row will n= ot be O(1) as it it may have to read from several files. Writes are more co= nstant than reads. But remember, you can have a lot of io and cpu power in = your cluster.

Best advice is to jump in and see if the data model works for you at a smal= l single node scale, most performance issues can be solved.=C2=A0

Aaron
On 12 Apr 2011, at 15:34, csharpplusproject wrote:
Hi Aaron,

Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank you very much!

First of all, by 'interactive' queries, are you referring to &#= 39;real-time' queries? (meaning, where experiments data is 'streami= ng', data needs to be stored and following that, the query needs to be = run in real time)?

Looking at the design of the particle= pairs:

- key: expriement_id.time_interval
- column name: pair_id
- column value: distance, angle, other data packed together as JSON or = some other format

A couple of questions:

(1) Will a query such as pairID[ expriement_id.time_interval = ] will basically return an array of all paidIDs for the experiment, whe= re each item is a 'packed' JSON?
(2) Would it be possible, rather than returning the whole JSON object p= er every pairID, to get (say) only the distance?
(3) Would it be possible to easily update certain 'pairIDs' wit= h new values (for example, update pairIDs =3D {2389, 93434} with new distance values)?

Looking at the design of the distance= CF (for example):

this is VERY INTERESTING. basically you are suggesting a design that wi= ll save the actual distance between each pair of particles, and will allow = queries where we can find all pairIDs (for an experiment, on time_interval)= that meet a certain distance criteria. VERY, VERY INTERESTING!

A couple of questions:

(1) Will a query such as distanceCF[ expriement_id.time_interval= ] will basically return an array of all 'zero_padded_distance.pair_id' elements for the experime= nt?
(2) In such a case, I will get (presumably) a python list where every i= tem is a string (and I will need to process it)?
(3) Given the fact that we're doing a slice on millions of columns = (?), any idea how fast such an operation would be?


Just to make sure I understand, is it true that in both situations, the= query complexity is basically O(1) since it's simply a HASH?


Thank you for all of your help!

Shalom.

-----Original Message-----
From: aaron morton <aaron@thelastpickle.com><= br> Reply-to: user@cassandra.apache.org
To: user@cassandra.apache.org
Subject: Re: Cassandra Database Modeling
Date: Tue, 12 Apr 2011 10:43:42 +1200

The tricky part here is the level of flexibility you want for the query= ing. In general you will want to denormalise to support the read queries. = =C2=A0

If your queries are not interactive you may be able to use Hadoop / Pig= / Hive e.g.=C2=A0http://www.datastax.com/products/brisk=C2=A0In which case y= ou can probably have a simpler data model where you spend less effort suppo= rting the queries. But it sounds like you need interactive queries as part = of the experiment.

You could store the data per pair in a standard CF (lets call it the pa= ir cf) as follows:

- key: expriement_id.time_interval - column name: pair_id - column valu= e: distance, angle, other data packed together as JSON or some other format=

This would support a basic record of what happened, for each time inter= val you can get the list of all pairs and read their data.=C2=A0

To support your spatial queries you could use two standard standard CFs= as follows:

distance CF: - key: experiment_id.time_interval - colunm name: zero_pad= ded_distance.pair_id - column value: empty or the angle=C2=A0

angle CF : - key: experiment_id.time_interval - colunm name: zero_padde= d_angle.pair_id - column value: empty or the distance

(two pairs can have the same distance and/or angle in same time slice) =

Here we are using the column name as a compound value, and am assuming = they can be byte ordered. So for distance the column name looks something l= ike 000500.123456789. You would then use the Byte comparator (or similar) f= or the columns. =C2=A0

To find all of the particles for experiment 2 at t5 where distance is &= lt; 100 you would use a get_slice (see=C2=A0http://wiki.apache.org/cassandra/API=C2=A0or your higher level client docs) against the key "2.5" wi= th a SliceRange start at "000000.000000000" and finish at "0= 00100.999999999". Once you have this list of columns you can either fi= lter client side for the angle or issue another query for the particles ins= ide the angle range. Then join the two results client side using the pair_i= d returned in the column names.=C2=A0

By using the same key for all 3 CF's all the data for a time slice = will be stored on the same nodes. You can potentially spread this around by= using slightly different keys so they may hash to different areas of the c= luster. e.g. expriement_id.time_interval."distance"

Data volume is not a concern, and it's not possible to talk about p= erformance until you have an idea of the workload and required throughput. = But writes are fast and I think your reads would be fast as well as the row= data for distance and angle will not change so caches will be be useful.= =C2=A0 =C2=A0

Hope that helps.=C2=A0 Aaron

On 12 Apr 2011, at 03:01, Shalom wrote:
I would like to save statistics on 10,000,000 (ten millions) pairs = of
particles, how they relate to one another in any given space in tim= e.

So suppose that within a total experiment time of T1..T1000 (assume= that T1
is when the experiment starts, and T1000 is the time when the exper= iment
ends) I would like, per each pair of particles, to measure the rela= tionship
between every Tn -- T(n+1) interval:

T1..T2 (this is the first interval)

T2..T3

T3..T4

......

......

T9,999,999..T10,000,000 (this is the last interval)

For each such a particle pair (there are 10,000,000 pairs) I would = like to
save some figures (such as distance, angel etc) on each interval of= [
Tn..T(n+1) ]

Once saved, the query I will be using to retrieve this data is as f= ollows:
"give me all particle pairs on time interval [ Tn..T(n+1) ] wh= ere the
distance between the two particles is smaller than X and the angle = between
the two particles is greater than Y". Meaning, the query will = always take
place for all particle pairs on a certain interval of time.

How would you model this in Cassandra, so that the writes/reads are=
optimized? given the database size involved, can you recommend on a= suitable
solution? (I have been recommended to both MongoDB / Cassandra).
I should mention that the data does change often -- we run many suc= h
experiments (different particle sets / thousands of experiments) an= d would
need a very decent performance of reads/writes.

Is Cassandra suitable for this time of work?


--
View this message in context:
http://cassandra-user-incubator-apache-or= g.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html<= /a>
Sent from the
cassandra-user@incubator.apache.org mailing list ar= chive at Nabble.com.








--00032555f692247c7d04a0c51260--