Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 18977 invoked from network); 11 Apr 2010 17:35:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Apr 2010 17:35:40 -0000 Received: (qmail 86551 invoked by uid 500); 11 Apr 2010 17:35:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86515 invoked by uid 500); 11 Apr 2010 17:35:40 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86507 invoked by uid 99); 11 Apr 2010 17:35:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Apr 2010 17:35:40 +0000 X-ASF-Spam-Status: No, hits=0.6 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Apr 2010 17:35:36 +0000 Received: by qw-out-2122.google.com with SMTP id 8so1671638qwh.61 for ; Sun, 11 Apr 2010 10:35:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.233.75 with HTTP; Sun, 11 Apr 2010 10:35:14 -0700 (PDT) In-Reply-To: References: <1270857660.3807.23.camel@malsmith-laptop> Date: Sun, 11 Apr 2010 10:35:14 -0700 Received: by 10.229.234.68 with SMTP id kb4mr4221408qcb.103.1271007314461; Sun, 11 Apr 2010 10:35:14 -0700 (PDT) Message-ID: Subject: Re: How to perform queries on Cassandra? From: Benjamin Black To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Ah, I see what you are doing. No, that won't work. The mistake you are making is in thinking a CF is a table and that Cassandra is storing columns. A CF is a namespace and Cassandra stores key/value pairs, where the keys are the row keys and the values are maps: 'usr1': {'password':'foo', 'thing':'blah'} 'usr2': {'password':'bar', 'info':3423452} and so on. The only query you can actually perform is a get on a row key. The column sorting and slicing are just conveniences to avoid retrieving entire rows every time you want a subset of data from them. There is not an interface for querying columns _without_ a row key, which is what you are proposing. Even if there were, there is a serious problem in your data model as you can't tell which part is which; were my password 'vineet' the system you describe would be unable to distinguish between our entries. I am ignoring here the existence of the order-preserving partitioner because I don't think it is generally useful and any attempt to apply it here would be wildly inefficient compared to simply maintaining proper indices. b On Sun, Apr 11, 2010 at 2:33 AM, Lucifer Dignified wrote: > Hi Benjamin > > I'll try to make it more clear to you. > We have a user table with fields 'id', 'username', and 'password'. Now if > use the ideal way to store key/value, like : > username : vineetdaniel > timestamp > password : > timestamp > > second user : > > username: > timestamp > password: > > and so on, here what i assume is that as we cannot make search on values = (as > confirmed by guys on cassandra forums) we are not able to perform robust > 'where' queries. Now what i propose is this. > > Rather than using a static values for column names use values itself and > unique key as identifier. So, the above example when put in as per me wou= ld > be. > > vineetdaniel : vineetdaniel > timestamp > > : > timestamp > > second user > seconduser:seconduser > timestamp > > password:password > timestamp > > By using above methodology we can simply make search on keys itself rathe= r > than going into using different CF's. But to add further, this cannot be > used for every situation. I am still exploring this, and soon will be > updating the group and my blog with information pertaining to this. As > cassandra is new, I think every idea or experience should be shared with = the > community. > > I hope I example is clear this time. Should you have any queries feel fre= e > to revert. > > On Sun, Apr 11, 2010 at 2:01 PM, Benjamin Black wrote: >> >> Sorry, I don't understand your example. >> >> On Sun, Apr 11, 2010 at 12:54 AM, Lucifer Dignified >> wrote: >> > Benjamin I quite agree to you, but what in case of duplicate usernames= , >> > suppose if I am not using unique names as in email id's . If we have >> > duplicacy in usernames we cannot use it for key, so what should be the >> > solution. I think keeping incremental numeric id as key and keeping th= e >> > name >> > and value same in the column family. >> > >> > Example : >> > User1 has password as 123456 >> > >> > Cassandra structure : >> > >> > 1 as key >> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 user1 - column name >> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 value - user1 >> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 123456 - column name >> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 value - 123456 >> > >> > I m thinking of doing it this way for my applicaton, this way i can ru= n >> > different sorts of queries too. Any feedback on this is welcome. >> > >> > On Sun, Apr 11, 2010 at 1:13 PM, Benjamin Black wrote: >> >> >> >> You would have a Column Family, not a column for that; let's call it >> >> the Users CF. =A0You'd use username as the row key and have a column >> >> called 'password'. =A0For your example query, you'd retrieve row key >> >> 'usr2', column 'password'. =A0The general pattern is that you create = CFs >> >> to act as indices for each query you want to perform. =A0There is no >> >> equivalent to a relational store to perform arbitrary queries. =A0You >> >> must structure things to permit the queries of interest. >> >> >> >> >> >> b >> >> >> >> On Sat, Apr 10, 2010 at 8:34 PM, dir dir wrot= e: >> >> > I have already read the API spesification. Honestly I do not >> >> > understand >> >> > how to use it. Because there are not an examples. >> >> > >> >> > For example I have a column like this: >> >> > >> >> > UserName =A0=A0 Password >> >> > usr1 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 abc >> >> > usr2 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 xyz >> >> > usr3 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 opm >> >> > >> >> > suppose I want query the user's password using SQL in RDBMS >> >> > >> >> > =A0=A0=A0=A0=A0 Select Password From Users Where UserName =3D "usr2= "; >> >> > >> >> > Now I want to get the password using OODBMS DB4o Object Query=A0 an= d >> >> > Java >> >> > >> >> > =A0=A0=A0=A0 ObjectSet QueryResult =3D db.query(new Predicate() >> >> > =A0=A0=A0=A0 { >> >> > =A0=A0=A0=A0 =A0 =A0 =A0=A0 public boolean match(Users Myusers) >> >> > =A0=A0=A0 =A0 =A0 =A0 =A0 { >> >> > =A0=A0 =A0 =A0 =A0 =A0=A0=A0=A0=A0=A0=A0 return Myuser.getUserName(= ) =3D=3D "usr2"; >> >> > =A0 =A0 =A0 =A0=A0 =A0=A0 } >> >> > =A0=A0=A0=A0 }); >> >> > >> >> > After we get the Users instance in the QueryResult, hence we can ge= t >> >> > the >> >> > usr2's password. >> >> > >> >> > How we perform this query using Cassandra API and Java?? >> >> > Would you tell me please??=A0 Thank You. >> >> > >> >> > Dir. >> >> > >> >> > >> >> > On Sat, Apr 10, 2010 at 11:06 AM, Paul Prescod >> >> > wrote: >> >> >> >> >> >> No. Cassandra has an API. >> >> >> >> >> >> http://wiki.apache.org/cassandra/API >> >> >> >> >> >> On Fri, Apr 9, 2010 at 8:00 PM, dir dir >> >> >> wrote: >> >> >> > Does Cassandra has a default query language such as SQL in RDBMS >> >> >> > and Object Query in OODBMS?=A0 Thank you. >> >> >> > >> >> >> > Dir. >> >> >> > >> >> >> > On Sat, Apr 10, 2010 at 7:01 AM, malsmith >> >> >> > >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> It's sort of an interesting problem - in RDBMS one relatively >> >> >> >> simple >> >> >> >> approach would be calculate a rectangle that is X km by Y km wi= th >> >> >> >> User >> >> >> >> 1's >> >> >> >> location at the center.=A0 So the rectangle is UserX - 10KmX , >> >> >> >> UserY-10KmY to >> >> >> >> UserX+10KmX , UserY+10KmY >> >> >> >> >> >> >> >> Then you could query the database for all other users where tha= t >> >> >> >> each >> >> >> >> user >> >> >> >> considered is curUserX > UserX-10Km and curUserX < UserX+10KmX >> >> >> >> and >> >> >> >> curUserY >> >> >> >> > UserY-10KmY and curUserY < UserY+10KmY >> >> >> >> * Not the 10KmX and 10KmY are really a translation from >> >> >> >> Kilometers >> >> >> >> to >> >> >> >> degrees of=A0 lat and longitude=A0 (that you can find on a goog= le >> >> >> >> search) >> >> >> >> >> >> >> >> With the right indexes this query actually runs pretty well. >> >> >> >> >> >> >> >> Translating that to Cassandra seems a bit complex at first - bu= t >> >> >> >> you >> >> >> >> could >> >> >> >> try something like pre-calculating a grid with the right >> >> >> >> resolution >> >> >> >> (like a >> >> >> >> square of 5KM per side) and assign every user to a particular >> >> >> >> grid >> >> >> >> ID. >> >> >> >> That >> >> >> >> way you just calculate with grid ID User1 is in then do a direc= t >> >> >> >> key >> >> >> >> lookup >> >> >> >> to get a list of the users in that same grid id. >> >> >> >> >> >> >> >> A second approach would be to have to column families -- one th= at >> >> >> >> maps >> >> >> >> a >> >> >> >> Latitude to a list of users who are at that latitude and a seco= nd >> >> >> >> that >> >> >> >> maps >> >> >> >> users who are at a particular longitude.=A0 You could do the sa= me >> >> >> >> rectange >> >> >> >> calculation above then do a get_slice range lookup to get a lis= t >> >> >> >> of >> >> >> >> users >> >> >> >> from range of latitude and a second list from the range of >> >> >> >> longitudes. >> >> >> >> You would then need to do a in-memory nested loop to find the >> >> >> >> list >> >> >> >> of >> >> >> >> users >> >> >> >> that are in both lists.=A0 This second approach could cause som= e >> >> >> >> trouble >> >> >> >> depending on where you search and how many users you really hav= e >> >> >> >> -- >> >> >> >> some >> >> >> >> latitudes and longitudes have many many people in them >> >> >> >> >> >> >> >> So, it seems some version of a chunking / grid id thing would b= e >> >> >> >> the >> >> >> >> better approach.=A0=A0 If you let people zoom in or zoom out - = you >> >> >> >> could >> >> >> >> just >> >> >> >> have different column families for each level of zoom. >> >> >> >> >> >> >> >> >> >> >> >> I'm stuck on a stopped train so -- here is even more code: >> >> >> >> >> >> >> >> static Decimal GetLatitudeMiles(Decimal lat) >> >> >> >> { >> >> >> >> Decimal f =3D 0.0M; >> >> >> >> lat =3D Math.Abs(lat); >> >> >> >> f =3D 68.99M; >> >> >> >> =A0=A0=A0=A0=A0=A0=A0=A0 if (lat >=3D 0.0M && lat < 10.0M) { f = =3D 68.71M; } >> >> >> >> else if (lat >=3D 10.0M && lat < 20.0M) { f =3D 68.73M; } >> >> >> >> else if (lat >=3D 20.0M && lat < 30.0M) { f =3D 68.79M; } >> >> >> >> else if (lat >=3D 30.0M && lat < 40.0M) { f =3D 68.88M; } >> >> >> >> else if (lat >=3D 40.0M && lat < 50.0M) { f =3D 68.99M; } >> >> >> >> else if (lat >=3D 50.0M && lat < 60.0M) { f =3D 69.12M; } >> >> >> >> else if (lat >=3D 60.0M && lat < 70.0M) { f =3D 69.23M; } >> >> >> >> else if (lat >=3D 70.0M && lat < 80.0M) { f =3D 69.32M; } >> >> >> >> else if (lat >=3D 80.0M) { f =3D 69.38M; } >> >> >> >> >> >> >> >> return f; >> >> >> >> } >> >> >> >> >> >> >> >> >> >> >> >> Decimal MilesPerDegreeLatitude =3D >> >> >> >> GetLatitudeMiles(zList[0].Latitude); >> >> >> >> Decimal MilesPerDegreeLongitude =3D ((Decimal) >> >> >> >> Math.Abs(Math.Cos((Double) >> >> >> >> zList[0].Latitude))) * 24900.0M / 360.0M; >> >> >> >> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 dRadius =3D 10.0M=A0 // ten miles >> >> >> >> Decimal deltaLat =3D dRadius / MilesPerDegreeLatitude; >> >> >> >> Decimal deltaLong =3D dRadius / MilesPerDegreeLongitude; >> >> >> >> >> >> >> >> ps.TopLatitude =3D zList[0].Latitude - deltaLat; >> >> >> >> ps.TopLongitude =3D zList[0].Longitude - deltaLong; >> >> >> >> ps.BottomLatitude =3D zList[0].Latitude + deltaLat; >> >> >> >> ps.BottomLongitude =3D zList[0].Longitude + deltaLong; >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote: >> >> >> >> >> >> >> >> 2010/4/9 Onur AKTAS : >> >> >> >> > ... >> >> >> >> > I'm trying to find out how do you perform queries with >> >> >> >> > calculations >> >> >> >> > on >> >> >> >> > the >> >> >> >> > fly without inserting the data as calculated from the >> >> >> >> > beginning. >> >> >> >> > Lets say we have latitude and longitude coordinates of all >> >> >> >> > users >> >> >> >> > and >> >> >> >> > we >> >> >> >> > have >> >> >> >> > =A0Distance(from_lat, from_long, to_lat, to_long) function wh= ich >> >> >> >> > gives distance between lat/longs pairs in kilometers. >> >> >> >> >> >> >> >> I'm not an expert, but I think that it boils down to "MapReduce= " >> >> >> >> and >> >> >> >> "Hadoop". >> >> >> >> >> >> >> >> I don't think that there's any top-down tutorial on those two >> >> >> >> words, >> >> >> >> you'll have to research yourself starting here: >> >> >> >> >> >> >> >> =A0* http://en.wikipedia.org/wiki/MapReduce >> >> >> >> >> >> >> >> =A0* http://hadoop.apache.org/ >> >> >> >> >> >> >> >> =A0* http://wiki.apache.org/cassandra/HadoopSupport >> >> >> >> >> >> >> >> I don't think it is all documented in any one place yet... >> >> >> >> >> >> >> >> =A0Paul Prescod >> >> >> >> >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > >