Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <x2t8f9ac4a91004110233s120c5290i5fb466b985ffa9b@mail.gmail.com>
References: <COL122-W33496ADBFA6EEE8B1631EBE7150@phx.gbl>
	 <q2t1cb725391004091630g59038efcy4505ffa47b88f3b2@mail.gmail.com>
	 <1270857660.3807.23.camel@malsmith-laptop>
	 <n2k7ca6fc441004092000hc7d49a44rff6192e768241789@mail.gmail.com>
	 <p2p1cb725391004092106q3bcba906i24da3a25d66b1621@mail.gmail.com>
	 <h2r7ca6fc441004102034k81b6bbddw4d05af62e78045a1@mail.gmail.com>
	 <p2ia7fcf8301004110043tb3896d19s1d46344657af49d6@mail.gmail.com>
	 <x2r8f9ac4a91004110054s3e6c0b1czc201398036c053c1@mail.gmail.com>
	 <q2pa7fcf8301004110131y4af6fbf7p8c64daad25fb9dd8@mail.gmail.com>
	 <x2t8f9ac4a91004110233s120c5290i5fb466b985ffa9b@mail.gmail.com>
Date: Sun, 11 Apr 2010 10:35:14 -0700
Message-ID: <u2ra7fcf8301004111035w1f7805eao9a5eefaec761a6e9@mail.gmail.com>
Subject: Re: How to perform queries on Cassandra?
From: Benjamin Black <b@b3k.us>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Ah, I see what you are doing.  No, that won't work.  The mistake you
are making is in thinking a CF is a table and that Cassandra is
storing columns.  A CF is a namespace and Cassandra stores key/value
pairs, where the keys are the row keys and the values are maps:

'usr1': {'password':'foo', 'thing':'blah'}
'usr2': {'password':'bar', 'info':3423452}

and so on.

The only query you can actually perform is a get on a row key.  The
column sorting and slicing are just conveniences to avoid retrieving
entire rows every time you want a subset of data from them.  There is
not an interface for querying columns _without_ a row key, which is
what you are proposing.  Even if there were, there is a serious
problem in your data model as you can't tell which part is which; were
my password 'vineet' the system you describe would be unable to
distinguish between our entries.

I am ignoring here the existence of the order-preserving partitioner
because I don't think it is generally useful and any attempt to apply
it here would be wildly inefficient compared to simply maintaining
proper indices.


b

On Sun, Apr 11, 2010 at 2:33 AM, Lucifer Dignified
<vineetdaniel@gmail.com> wrote:
> Hi Benjamin
>
> I'll try to make it more clear to you.
> We have a user table with fields 'id', 'username', and 'password'. Now if
> use the ideal way to store key/value, like :
> username : vineetdaniel
> timestamp
> password : <password>
> timestamp
>
> second user :
>
> username: <seconduser>
> timestamp
> password:<password>
>
> and so on, here what i assume is that as we cannot make search on values =
(as
> confirmed by guys on cassandra forums) we are not able to perform robust
> 'where' queries. Now what i propose is this.
>
> Rather than using a static values for column names use values itself and
> unique key as identifier. So, the above example when put in as per me wou=
ld
> be.
>
> vineetdaniel : vineetdaniel
> timestamp
>
> <password>:<password>
> timestamp
>
> second user
> seconduser:seconduser
> timestamp
>
> password:password
> timestamp
>
> By using above methodology we can simply make search on keys itself rathe=
r
> than going into using different CF's. But to add further, this cannot be
> used for every situation. I am still exploring this, and soon will be
> updating the group and my blog with information pertaining to this. As
> cassandra is new, I think every idea or experience should be shared with =
the
> community.
>
> I hope I example is clear this time. Should you have any queries feel fre=
e
> to revert.
>
> On Sun, Apr 11, 2010 at 2:01 PM, Benjamin Black <b@b3k.us> wrote:
>>
>> Sorry, I don't understand your example.
>>
>> On Sun, Apr 11, 2010 at 12:54 AM, Lucifer Dignified
>> <vineetdaniel@gmail.com> wrote:
>> > Benjamin I quite agree to you, but what in case of duplicate usernames=
,
>> > suppose if I am not using unique names as in email id's . If we have
>> > duplicacy in usernames we cannot use it for key, so what should be the
>> > solution. I think keeping incremental numeric id as key and keeping th=
e
>> > name
>> > and value same in the column family.
>> >
>> > Example :
>> > User1 has password as 123456
>> >
>> > Cassandra structure :
>> >
>> > 1 as key
>> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 user1 - column name
>> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 value - user1
>> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 123456 - column name
>> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 value - 123456
>> >
>> > I m thinking of doing it this way for my applicaton, this way i can ru=
n
>> > different sorts of queries too. Any feedback on this is welcome.
>> >
>> > On Sun, Apr 11, 2010 at 1:13 PM, Benjamin Black <b@b3k.us> wrote:
>> >>
>> >> You would have a Column Family, not a column for that; let's call it
>> >> the Users CF. =A0You'd use username as the row key and have a column
>> >> called 'password'. =A0For your example query, you'd retrieve row key
>> >> 'usr2', column 'password'. =A0The general pattern is that you create =
CFs
>> >> to act as indices for each query you want to perform. =A0There is no
>> >> equivalent to a relational store to perform arbitrary queries. =A0You
>> >> must structure things to permit the queries of interest.
>> >>
>> >>
>> >> b
>> >>
>> >> On Sat, Apr 10, 2010 at 8:34 PM, dir dir <sikerasakti@gmail.com> wrot=
e:
>> >> > I have already read the API spesification. Honestly I do not
>> >> > understand
>> >> > how to use it. Because there are not an examples.
>> >> >
>> >> > For example I have a column like this:
>> >> >
>> >> > UserName =A0=A0 Password
>> >> > usr1 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 abc
>> >> > usr2 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 xyz
>> >> > usr3 =A0 =A0 =A0 =A0 =A0 =A0 =A0=A0 opm
>> >> >
>> >> > suppose I want query the user's password using SQL in RDBMS
>> >> >
>> >> > =A0=A0=A0=A0=A0 Select Password From Users Where UserName =3D "usr2=
";
>> >> >
>> >> > Now I want to get the password using OODBMS DB4o Object Query=A0 an=
d
>> >> > Java
>> >> >
>> >> > =A0=A0=A0=A0 ObjectSet QueryResult =3D db.query(new Predicate()
>> >> > =A0=A0=A0=A0 {
>> >> > =A0=A0=A0=A0 =A0 =A0 =A0=A0 public boolean match(Users Myusers)
>> >> > =A0=A0=A0 =A0 =A0 =A0 =A0 {
>> >> > =A0=A0 =A0 =A0 =A0 =A0=A0=A0=A0=A0=A0=A0 return Myuser.getUserName(=
) =3D=3D "usr2";
>> >> > =A0 =A0 =A0 =A0=A0 =A0=A0 }
>> >> > =A0=A0=A0=A0 });
>> >> >
>> >> > After we get the Users instance in the QueryResult, hence we can ge=
t
>> >> > the
>> >> > usr2's password.
>> >> >
>> >> > How we perform this query using Cassandra API and Java??
>> >> > Would you tell me please??=A0 Thank You.
>> >> >
>> >> > Dir.
>> >> >
>> >> >
>> >> > On Sat, Apr 10, 2010 at 11:06 AM, Paul Prescod <paul@prescod.net>
>> >> > wrote:
>> >> >>
>> >> >> No. Cassandra has an API.
>> >> >>
>> >> >> http://wiki.apache.org/cassandra/API
>> >> >>
>> >> >> On Fri, Apr 9, 2010 at 8:00 PM, dir dir <sikerasakti@gmail.com>
>> >> >> wrote:
>> >> >> > Does Cassandra has a default query language such as SQL in RDBMS
>> >> >> > and Object Query in OODBMS?=A0 Thank you.
>> >> >> >
>> >> >> > Dir.
>> >> >> >
>> >> >> > On Sat, Apr 10, 2010 at 7:01 AM, malsmith
>> >> >> > <malsmith@treehousesystems.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >> It's sort of an interesting problem - in RDBMS one relatively
>> >> >> >> simple
>> >> >> >> approach would be calculate a rectangle that is X km by Y km wi=
th
>> >> >> >> User
>> >> >> >> 1's
>> >> >> >> location at the center.=A0 So the rectangle is UserX - 10KmX ,
>> >> >> >> UserY-10KmY to
>> >> >> >> UserX+10KmX , UserY+10KmY
>> >> >> >>
>> >> >> >> Then you could query the database for all other users where tha=
t
>> >> >> >> each
>> >> >> >> user
>> >> >> >> considered is curUserX > UserX-10Km and curUserX < UserX+10KmX
>> >> >> >> and
>> >> >> >> curUserY
>> >> >> >> > UserY-10KmY and curUserY < UserY+10KmY
>> >> >> >> * Not the 10KmX and 10KmY are really a translation from
>> >> >> >> Kilometers
>> >> >> >> to
>> >> >> >> degrees of=A0 lat and longitude=A0 (that you can find on a goog=
le
>> >> >> >> search)
>> >> >> >>
>> >> >> >> With the right indexes this query actually runs pretty well.
>> >> >> >>
>> >> >> >> Translating that to Cassandra seems a bit complex at first - bu=
t
>> >> >> >> you
>> >> >> >> could
>> >> >> >> try something like pre-calculating a grid with the right
>> >> >> >> resolution
>> >> >> >> (like a
>> >> >> >> square of 5KM per side) and assign every user to a particular
>> >> >> >> grid
>> >> >> >> ID.
>> >> >> >> That
>> >> >> >> way you just calculate with grid ID User1 is in then do a direc=
t
>> >> >> >> key
>> >> >> >> lookup
>> >> >> >> to get a list of the users in that same grid id.
>> >> >> >>
>> >> >> >> A second approach would be to have to column families -- one th=
at
>> >> >> >> maps
>> >> >> >> a
>> >> >> >> Latitude to a list of users who are at that latitude and a seco=
nd
>> >> >> >> that
>> >> >> >> maps
>> >> >> >> users who are at a particular longitude.=A0 You could do the sa=
me
>> >> >> >> rectange
>> >> >> >> calculation above then do a get_slice range lookup to get a lis=
t
>> >> >> >> of
>> >> >> >> users
>> >> >> >> from range of latitude and a second list from the range of
>> >> >> >> longitudes.
>> >> >> >> You would then need to do a in-memory nested loop to find the
>> >> >> >> list
>> >> >> >> of
>> >> >> >> users
>> >> >> >> that are in both lists.=A0 This second approach could cause som=
e
>> >> >> >> trouble
>> >> >> >> depending on where you search and how many users you really hav=
e
>> >> >> >> --
>> >> >> >> some
>> >> >> >> latitudes and longitudes have many many people in them
>> >> >> >>
>> >> >> >> So, it seems some version of a chunking / grid id thing would b=
e
>> >> >> >> the
>> >> >> >> better approach.=A0=A0 If you let people zoom in or zoom out - =
you
>> >> >> >> could
>> >> >> >> just
>> >> >> >> have different column families for each level of zoom.
>> >> >> >>
>> >> >> >>
>> >> >> >> I'm stuck on a stopped train so -- here is even more code:
>> >> >> >>
>> >> >> >> static Decimal GetLatitudeMiles(Decimal lat)
>> >> >> >> {
>> >> >> >> Decimal f =3D 0.0M;
>> >> >> >> lat =3D Math.Abs(lat);
>> >> >> >> f =3D 68.99M;
>> >> >> >> =A0=A0=A0=A0=A0=A0=A0=A0 if (lat >=3D 0.0M && lat < 10.0M) { f =
=3D 68.71M; }
>> >> >> >> else if (lat >=3D 10.0M && lat < 20.0M) { f =3D 68.73M; }
>> >> >> >> else if (lat >=3D 20.0M && lat < 30.0M) { f =3D 68.79M; }
>> >> >> >> else if (lat >=3D 30.0M && lat < 40.0M) { f =3D 68.88M; }
>> >> >> >> else if (lat >=3D 40.0M && lat < 50.0M) { f =3D 68.99M; }
>> >> >> >> else if (lat >=3D 50.0M && lat < 60.0M) { f =3D 69.12M; }
>> >> >> >> else if (lat >=3D 60.0M && lat < 70.0M) { f =3D 69.23M; }
>> >> >> >> else if (lat >=3D 70.0M && lat < 80.0M) { f =3D 69.32M; }
>> >> >> >> else if (lat >=3D 80.0M) { f =3D 69.38M; }
>> >> >> >>
>> >> >> >> return f;
>> >> >> >> }
>> >> >> >>
>> >> >> >>
>> >> >> >> Decimal MilesPerDegreeLatitude =3D
>> >> >> >> GetLatitudeMiles(zList[0].Latitude);
>> >> >> >> Decimal MilesPerDegreeLongitude =3D ((Decimal)
>> >> >> >> Math.Abs(Math.Cos((Double)
>> >> >> >> zList[0].Latitude))) * 24900.0M / 360.0M;
>> >> >> >> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0 dRadius =3D 10.0M=A0 // ten miles
>> >> >> >> Decimal deltaLat =3D dRadius / MilesPerDegreeLatitude;
>> >> >> >> Decimal deltaLong =3D dRadius / MilesPerDegreeLongitude;
>> >> >> >>
>> >> >> >> ps.TopLatitude =3D zList[0].Latitude - deltaLat;
>> >> >> >> ps.TopLongitude =3D zList[0].Longitude - deltaLong;
>> >> >> >> ps.BottomLatitude =3D zList[0].Latitude + deltaLat;
>> >> >> >> ps.BottomLongitude =3D zList[0].Longitude + deltaLong;
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
>> >> >> >>
>> >> >> >> 2010/4/9 Onur AKTAS <onur.aktas@live.com>:
>> >> >> >> > ...
>> >> >> >> > I'm trying to find out how do you perform queries with
>> >> >> >> > calculations
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > fly without inserting the data as calculated from the
>> >> >> >> > beginning.
>> >> >> >> > Lets say we have latitude and longitude coordinates of all
>> >> >> >> > users
>> >> >> >> > and
>> >> >> >> > we
>> >> >> >> > have
>> >> >> >> > =A0Distance(from_lat, from_long, to_lat, to_long) function wh=
ich
>> >> >> >> > gives distance between lat/longs pairs in kilometers.
>> >> >> >>
>> >> >> >> I'm not an expert, but I think that it boils down to "MapReduce=
"
>> >> >> >> and
>> >> >> >> "Hadoop".
>> >> >> >>
>> >> >> >> I don't think that there's any top-down tutorial on those two
>> >> >> >> words,
>> >> >> >> you'll have to research yourself starting here:
>> >> >> >>
>> >> >> >> =A0* http://en.wikipedia.org/wiki/MapReduce
>> >> >> >>
>> >> >> >> =A0* http://hadoop.apache.org/
>> >> >> >>
>> >> >> >> =A0* http://wiki.apache.org/cassandra/HadoopSupport
>> >> >> >>
>> >> >> >> I don't think it is all documented in any one place yet...
>> >> >> >>
>> >> >> >> =A0Paul Prescod
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>