incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: How to perform queries on Cassandra?
Date Sun, 11 Apr 2010 19:11:35 GMT
Just to be clear: do you understand we are saying you need to use
multiple CFs to achieve the goal, not a single one?

The Users CF would be indexed on a unique integer as you are saying
you intend.  There is no point in having values as column names here,
other than making things incredibly confusing.  Assume instead that
you have a column called 'username' and a column called 'password'.
In your model where usernames may be the same for different users, you
would have data that looked like this:

0: {'username':'usr1', 'password':'woop'}
1: {'username':'usr2', 'password':'foo'}
2: {'username':'usr2', 'password':'bar'}

The UsernameIndex CF would be indexed on usernames, giving a map from
a username to the unique identifiers in the Users CF with that
username:

'usr1': {0:0}
'usr2': {1:0, 2:0}

Note that since we don't care about the values in the UsernameIndex,
they are just set to 0.  You can stash data here, if you like, but it
can mean more overhead in maintaining data synchronization between the
raw data and the index data.  To perform your query on username
'usr2', you get 'usr2' from UsernameIndex CF, which gives you a set of
ids, and you then get those ids (1 and 2) from the Users CF.


b


On Sun, Apr 11, 2010 at 11:37 AM, vineet daniel <vineetdaniel@gmail.com> wrote:
> Well my initial idea is to use value  as column name, keeping key as an
> incremental integer. The discussion after each mail has drifted from this
> point which I had made. Will put it again.
>
> we want to store user information. We keep 1,2,3,4.....so on as keys. AND
> values as column names i.e rather than using column name 'first name', i'd
> be using 'vineet' as column name, rather than using 'last name' as column
> name i'd be using 'daniel'. This way I can directly read column names as
> values. This is just a thought that has come to my mind while trying to
> design my db for cassandra.
>
>
>
> On Sun, Apr 11, 2010 at 11:46 PM, Benjamin Black <b@b3k.us> wrote:
>>
>> Row keys must be unique.  If your usernames are not unique and you
>> want to be able to query on them, you either need to figure out a way
>> to make them unique or treat the username rows themselves as indices,
>> which refer to a set of actually unique identifiers for users.
>>
>> On Sun, Apr 11, 2010 at 11:12 AM, vineet daniel <vineetdaniel@gmail.com>
>> wrote:
>> > its not a problem its a scenario, which we need to handle. And all I am
>> > trying to do is to achieve what is not there with API i.e a workaroud.
>> >
>> > On Sun, Apr 11, 2010 at 11:06 PM, Benjamin Black <b@b3k.us> wrote:
>> >>
>> >> A system that permits multiple people to have the same username has a
>> >> serious problem.
>> >>
>> >> On Sun, Apr 11, 2010 at 6:12 AM, vineet daniel <vineetdaniel@gmail.com>
>> >> wrote:
>> >> > How to handle same usernames. Otherwise seems fine to me.
>> >> >
>> >> > On Sun, Apr 11, 2010 at 6:17 PM, Dop Sun <sunht@dopsun.com> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >>
>> >> >>
>> >> >> As far as I can see it, the Cassandra API currently supports
>> >> >> criterias
>> >> >> on:
>> >> >>
>> >> >> Token – Key – Super Column Name (if applicable) - Column Names
>> >> >>
>> >> >>
>> >> >>
>> >> >> I guess Token is not usually used for the day to day queries, so,
>> >> >> Key
>> >> >> and
>> >> >> Column Names are normally used for querying. For the user name
and
>> >> >> password
>> >> >> case, I guess it can be done like this:
>> >> >>
>> >> >>
>> >> >>
>> >> >> Define a CF as UserAuth with type as Super, and Key is user name,
>> >> >> while
>> >> >> password can be the SuperKeyName. So, while you receive the user
>> >> >> name
>> >> >> and
>> >> >> password from the UI (or any other methods), it can be queried
via:
>> >> >> multiget_slice or get_range_slices, if there are anything returned,
>> >> >> means
>> >> >> that the user name and password matches.
>> >> >>
>> >> >>
>> >> >>
>> >> >> If not using the super column name, and put the password as the
>> >> >> column
>> >> >> name, the column name usually not used for these kind of
>> >> >> discretionary
>> >> >> values (actually, I don’t see any definitive documents on how
to use
>> >> >> the
>> >> >> column Names and Super Columns, flexibility is the good of
>> >> >> Cassandra,
>> >> >> or is
>> >> >> it bad if abused? :P)
>> >> >>
>> >> >>
>> >> >>
>> >> >> Not sure whether this is the best way, but I guess it will work.
>> >> >>
>> >> >>
>> >> >>
>> >> >> Regards,
>> >> >>
>> >> >> Dop
>> >> >>
>> >> >>
>> >> >>
>> >> >> From: Lucifer Dignified [mailto:vineetdaniel@gmail.com]
>> >> >> Sent: Sunday, April 11, 2010 5:33 PM
>> >> >> To: user@cassandra.apache.org
>> >> >> Subject: Re: How to perform queries on Cassandra?
>> >> >>
>> >> >>
>> >> >>
>> >> >> Hi Benjamin
>> >> >>
>> >> >> I'll try to make it more clear to you.
>> >> >> We have a user table with fields 'id', 'username', and 'password'.
>> >> >> Now
>> >> >> if
>> >> >> use the ideal way to store key/value, like :
>> >> >> username : vineetdaniel
>> >> >> timestamp
>> >> >> password : <password>
>> >> >> timestamp
>> >> >>
>> >> >> second user :
>> >> >>
>> >> >> username: <seconduser>
>> >> >> timestamp
>> >> >> password:<password>
>> >> >>
>> >> >> and so on, here what i assume is that as we cannot make search
on
>> >> >> values
>> >> >> (as confirmed by guys on cassandra forums) we are not able to
>> >> >> perform
>> >> >> robust
>> >> >> 'where' queries. Now what i propose is this.
>> >> >>
>> >> >> Rather than using a static values for column names use values itself
>> >> >> and
>> >> >> unique key as identifier. So, the above example when put in as
per
>> >> >> me
>> >> >> would
>> >> >> be.
>> >> >>
>> >> >> vineetdaniel : vineetdaniel
>> >> >> timestamp
>> >> >>
>> >> >> <password>:<password>
>> >> >> timestamp
>> >> >>
>> >> >> second user
>> >> >> seconduser:seconduser
>> >> >> timestamp
>> >> >>
>> >> >> password:password
>> >> >> timestamp
>> >> >>
>> >> >> By using above methodology we can simply make search on keys itself
>> >> >> rather
>> >> >> than going into using different CF's. But to add further, this
>> >> >> cannot
>> >> >> be
>> >> >> used for every situation. I am still exploring this, and soon will
>> >> >> be
>> >> >> updating the group and my blog with information pertaining to this.
>> >> >> As
>> >> >> cassandra is new, I think every idea or experience should be shared
>> >> >> with the
>> >> >> community.
>> >> >>
>> >> >> I hope I example is clear this time. Should you have any queries
>> >> >> feel
>> >> >> free
>> >> >> to revert.
>> >> >>
>> >> >> On Sun, Apr 11, 2010 at 2:01 PM, Benjamin Black <b@b3k.us>
wrote:
>> >> >>
>> >> >> Sorry, I don't understand your example.
>> >> >>
>> >> >> On Sun, Apr 11, 2010 at 12:54 AM, Lucifer Dignified
>> >> >> <vineetdaniel@gmail.com> wrote:
>> >> >> > Benjamin I quite agree to you, but what in case of duplicate
>> >> >> > usernames,
>> >> >> > suppose if I am not using unique names as in email id's .
If we
>> >> >> > have
>> >> >> > duplicacy in usernames we cannot use it for key, so what should
be
>> >> >> > the
>> >> >> > solution. I think keeping incremental numeric id as key and
>> >> >> > keeping
>> >> >> > the
>> >> >> > name
>> >> >> > and value same in the column family.
>> >> >> >
>> >> >> > Example :
>> >> >> > User1 has password as 123456
>> >> >> >
>> >> >> > Cassandra structure :
>> >> >> >
>> >> >> > 1 as key
>> >> >> >            user1 - column name
>> >> >> >            value - user1
>> >> >> >            123456 - column name
>> >> >> >             value - 123456
>> >> >> >
>> >> >> > I m thinking of doing it this way for my applicaton, this
way i
>> >> >> > can
>> >> >> > run
>> >> >> > different sorts of queries too. Any feedback on this is welcome.
>> >> >> >
>> >> >> > On Sun, Apr 11, 2010 at 1:13 PM, Benjamin Black <b@b3k.us>
wrote:
>> >> >> >>
>> >> >> >> You would have a Column Family, not a column for that;
let's call
>> >> >> >> it
>> >> >> >> the Users CF.  You'd use username as the row key and
have a
>> >> >> >> column
>> >> >> >> called 'password'.  For your example query, you'd retrieve
row
>> >> >> >> key
>> >> >> >> 'usr2', column 'password'.  The general pattern is that
you
>> >> >> >> create
>> >> >> >> CFs
>> >> >> >> to act as indices for each query you want to perform.
 There is
>> >> >> >> no
>> >> >> >> equivalent to a relational store to perform arbitrary
queries.
>> >> >> >>  You
>> >> >> >> must structure things to permit the queries of interest.
>> >> >> >>
>> >> >> >>
>> >> >> >> b
>> >> >> >>
>> >> >> >> On Sat, Apr 10, 2010 at 8:34 PM, dir dir <sikerasakti@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > I have already read the API spesification. Honestly
I do not
>> >> >> >> > understand
>> >> >> >> > how to use it. Because there are not an examples.
>> >> >> >> >
>> >> >> >> > For example I have a column like this:
>> >> >> >> >
>> >> >> >> > UserName    Password
>> >> >> >> > usr1                abc
>> >> >> >> > usr2                xyz
>> >> >> >> > usr3                opm
>> >> >> >> >
>> >> >> >> > suppose I want query the user's password using SQL
in RDBMS
>> >> >> >> >
>> >> >> >> >       Select Password From Users Where UserName
= "usr2";
>> >> >> >> >
>> >> >> >> > Now I want to get the password using OODBMS DB4o
Object Query
>> >> >> >> > and
>> >> >> >> > Java
>> >> >> >> >
>> >> >> >> >      ObjectSet QueryResult = db.query(new Predicate()
>> >> >> >> >      {
>> >> >> >> >             public boolean match(Users Myusers)
>> >> >> >> >             {
>> >> >> >> >                  return Myuser.getUserName()
== "usr2";
>> >> >> >> >             }
>> >> >> >> >      });
>> >> >> >> >
>> >> >> >> > After we get the Users instance in the QueryResult,
hence we
>> >> >> >> > can
>> >> >> >> > get
>> >> >> >> > the
>> >> >> >> > usr2's password.
>> >> >> >> >
>> >> >> >> > How we perform this query using Cassandra API and
Java??
>> >> >> >> > Would you tell me please??  Thank You.
>> >> >> >> >
>> >> >> >> > Dir.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Sat, Apr 10, 2010 at 11:06 AM, Paul Prescod
>> >> >> >> > <paul@prescod.net>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> No. Cassandra has an API.
>> >> >> >> >>
>> >> >> >> >> http://wiki.apache.org/cassandra/API
>> >> >> >> >>
>> >> >> >> >> On Fri, Apr 9, 2010 at 8:00 PM, dir dir
>> >> >> >> >> <sikerasakti@gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >> > Does Cassandra has a default query language
such as SQL in
>> >> >> >> >> > RDBMS
>> >> >> >> >> > and Object Query in OODBMS?  Thank you.
>> >> >> >> >> >
>> >> >> >> >> > Dir.
>> >> >> >> >> >
>> >> >> >> >> > On Sat, Apr 10, 2010 at 7:01 AM, malsmith
>> >> >> >> >> > <malsmith@treehousesystems.com>
>> >> >> >> >> > wrote:
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> It's sort of an interesting problem
- in RDBMS one
>> >> >> >> >> >> relatively
>> >> >> >> >> >> simple
>> >> >> >> >> >> approach would be calculate a rectangle
that is X km by Y
>> >> >> >> >> >> km
>> >> >> >> >> >> with
>> >> >> >> >> >> User
>> >> >> >> >> >> 1's
>> >> >> >> >> >> location at the center.  So the rectangle
is UserX - 10KmX
>> >> >> >> >> >> ,
>> >> >> >> >> >> UserY-10KmY to
>> >> >> >> >> >> UserX+10KmX , UserY+10KmY
>> >> >> >> >> >>
>> >> >> >> >> >> Then you could query the database for
all other users where
>> >> >> >> >> >> that
>> >> >> >> >> >> each
>> >> >> >> >> >> user
>> >> >> >> >> >> considered is curUserX > UserX-10Km
and curUserX <
>> >> >> >> >> >> UserX+10KmX
>> >> >> >> >> >> and
>> >> >> >> >> >> curUserY
>> >> >> >> >> >> > UserY-10KmY and curUserY < UserY+10KmY
>> >> >> >> >> >> * Not the 10KmX and 10KmY are really
a translation from
>> >> >> >> >> >> Kilometers
>> >> >> >> >> >> to
>> >> >> >> >> >> degrees of  lat and longitude  (that
you can find on a
>> >> >> >> >> >> google
>> >> >> >> >> >> search)
>> >> >> >> >> >>
>> >> >> >> >> >> With the right indexes this query actually
runs pretty
>> >> >> >> >> >> well.
>> >> >> >> >> >>
>> >> >> >> >> >> Translating that to Cassandra seems
a bit complex at first
>> >> >> >> >> >> -
>> >> >> >> >> >> but
>> >> >> >> >> >> you
>> >> >> >> >> >> could
>> >> >> >> >> >> try something like pre-calculating a
grid with the right
>> >> >> >> >> >> resolution
>> >> >> >> >> >> (like a
>> >> >> >> >> >> square of 5KM per side) and assign every
user to a
>> >> >> >> >> >> particular
>> >> >> >> >> >> grid
>> >> >> >> >> >> ID.
>> >> >> >> >> >> That
>> >> >> >> >> >> way you just calculate with grid ID
User1 is in then do a
>> >> >> >> >> >> direct
>> >> >> >> >> >> key
>> >> >> >> >> >> lookup
>> >> >> >> >> >> to get a list of the users in that same
grid id.
>> >> >> >> >> >>
>> >> >> >> >> >> A second approach would be to have to
column families --
>> >> >> >> >> >> one
>> >> >> >> >> >> that
>> >> >> >> >> >> maps
>> >> >> >> >> >> a
>> >> >> >> >> >> Latitude to a list of users who are
at that latitude and a
>> >> >> >> >> >> second
>> >> >> >> >> >> that
>> >> >> >> >> >> maps
>> >> >> >> >> >> users who are at a particular longitude. 
You could do the
>> >> >> >> >> >> same
>> >> >> >> >> >> rectange
>> >> >> >> >> >> calculation above then do a get_slice
range lookup to get a
>> >> >> >> >> >> list
>> >> >> >> >> >> of
>> >> >> >> >> >> users
>> >> >> >> >> >> from range of latitude and a second
list from the range of
>> >> >> >> >> >> longitudes.
>> >> >> >> >> >> You would then need to do a in-memory
nested loop to find
>> >> >> >> >> >> the
>> >> >> >> >> >> list
>> >> >> >> >> >> of
>> >> >> >> >> >> users
>> >> >> >> >> >> that are in both lists.  This second
approach could cause
>> >> >> >> >> >> some
>> >> >> >> >> >> trouble
>> >> >> >> >> >> depending on where you search and how
many users you really
>> >> >> >> >> >> have
>> >> >> >> >> >> --
>> >> >> >> >> >> some
>> >> >> >> >> >> latitudes and longitudes have many many
people in them
>> >> >> >> >> >>
>> >> >> >> >> >> So, it seems some version of a chunking
/ grid id thing
>> >> >> >> >> >> would
>> >> >> >> >> >> be
>> >> >> >> >> >> the
>> >> >> >> >> >> better approach.   If you let people
zoom in or zoom out -
>> >> >> >> >> >> you
>> >> >> >> >> >> could
>> >> >> >> >> >> just
>> >> >> >> >> >> have different column families for each
level of zoom.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> I'm stuck on a stopped train so -- here
is even more code:
>> >> >> >> >> >>
>> >> >> >> >> >> static Decimal GetLatitudeMiles(Decimal
lat)
>> >> >> >> >> >> {
>> >> >> >> >> >> Decimal f = 0.0M;
>> >> >> >> >> >> lat = Math.Abs(lat);
>> >> >> >> >> >> f = 68.99M;
>> >> >> >> >> >>          if (lat >= 0.0M
&& lat < 10.0M) { f = 68.71M; }
>> >> >> >> >> >> else if (lat >= 10.0M &&
lat < 20.0M) { f = 68.73M; }
>> >> >> >> >> >> else if (lat >= 20.0M &&
lat < 30.0M) { f = 68.79M; }
>> >> >> >> >> >> else if (lat >= 30.0M &&
lat < 40.0M) { f = 68.88M; }
>> >> >> >> >> >> else if (lat >= 40.0M &&
lat < 50.0M) { f = 68.99M; }
>> >> >> >> >> >> else if (lat >= 50.0M &&
lat < 60.0M) { f = 69.12M; }
>> >> >> >> >> >> else if (lat >= 60.0M &&
lat < 70.0M) { f = 69.23M; }
>> >> >> >> >> >> else if (lat >= 70.0M &&
lat < 80.0M) { f = 69.32M; }
>> >> >> >> >> >> else if (lat >= 80.0M) { f = 69.38M;
}
>> >> >> >> >> >>
>> >> >> >> >> >> return f;
>> >> >> >> >> >> }
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> Decimal MilesPerDegreeLatitude =
>> >> >> >> >> >> GetLatitudeMiles(zList[0].Latitude);
>> >> >> >> >> >> Decimal MilesPerDegreeLongitude = ((Decimal)
>> >> >> >> >> >> Math.Abs(Math.Cos((Double)
>> >> >> >> >> >> zList[0].Latitude))) * 24900.0M / 360.0M;
>> >> >> >> >> >>                        
dRadius = 10.0M  // ten miles
>> >> >> >> >> >> Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
>> >> >> >> >> >> Decimal deltaLong = dRadius / MilesPerDegreeLongitude;
>> >> >> >> >> >>
>> >> >> >> >> >> ps.TopLatitude = zList[0].Latitude -
deltaLat;
>> >> >> >> >> >> ps.TopLongitude = zList[0].Longitude
- deltaLong;
>> >> >> >> >> >> ps.BottomLatitude = zList[0].Latitude
+ deltaLat;
>> >> >> >> >> >> ps.BottomLongitude = zList[0].Longitude
+ deltaLong;
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Fri, 2010-04-09 at 16:30 -0700, Paul
Prescod wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> 2010/4/9 Onur AKTAS <onur.aktas@live.com>:
>> >> >> >> >> >> > ...
>> >> >> >> >> >> > I'm trying to find out how do you
perform queries with
>> >> >> >> >> >> > calculations
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > fly without inserting the data
as calculated from the
>> >> >> >> >> >> > beginning.
>> >> >> >> >> >> > Lets say we have latitude and longitude
coordinates of
>> >> >> >> >> >> > all
>> >> >> >> >> >> > users
>> >> >> >> >> >> > and
>> >> >> >> >> >> > we
>> >> >> >> >> >> > have
>> >> >> >> >> >> >  Distance(from_lat, from_long,
to_lat, to_long) function
>> >> >> >> >> >> > which
>> >> >> >> >> >> > gives distance between lat/longs
pairs in kilometers.
>> >> >> >> >> >>
>> >> >> >> >> >> I'm not an expert, but I think that
it boils down to
>> >> >> >> >> >> "MapReduce"
>> >> >> >> >> >> and
>> >> >> >> >> >> "Hadoop".
>> >> >> >> >> >>
>> >> >> >> >> >> I don't think that there's any top-down
tutorial on those
>> >> >> >> >> >> two
>> >> >> >> >> >> words,
>> >> >> >> >> >> you'll have to research yourself starting
here:
>> >> >> >> >> >>
>> >> >> >> >> >>  * http://en.wikipedia.org/wiki/MapReduce
>> >> >> >> >> >>
>> >> >> >> >> >>  * http://hadoop.apache.org/
>> >> >> >> >> >>
>> >> >> >> >> >>  * http://wiki.apache.org/cassandra/HadoopSupport
>> >> >> >> >> >>
>> >> >> >> >> >> I don't think it is all documented in
any one place yet...
>> >> >> >> >> >>
>> >> >> >> >> >>  Paul Prescod
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>
>

Mime
View raw message