Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of
 mike.e.gallamore@googlemail.com designates 209.85.222.190 as permitted
 sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:content-type;
        b=GuIk31J2ylTfRKHKrtiwvpDozNIMbP83lat91zYMLVED1FeAVpqZfHrN6a9OsGvWCa
         YZ0ztjrl/KQZzA4KUqyXnkmN0XB32V65zs34xF6QoaZ7/b+NLUq2F/qWD/oVL/o0/zXC
         jIG0igcgZek0M6/3pDQGFLe6UXAEe7yQ21WnI=
Message-ID: <4BBFCB74.4050307@gmail.com>
Date: Fri, 09 Apr 2010 17:51:00 -0700
From: Mike Gallamore <mike.e.gallamore@googlemail.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.1.9pre) Gecko/20100217 Shredder/3.0.3pre
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: How to perform queries on Cassandra?
References: <COL122-W33496ADBFA6EEE8B1631EBE7150@phx.gbl>
	 <q2t1cb725391004091630g59038efcy4505ffa47b88f3b2@mail.gmail.com>
 <1270857660.3807.23.camel@malsmith-laptop>
In-Reply-To: <1270857660.3807.23.camel@malsmith-laptop>
Content-Type: multipart/alternative;
 boundary="------------090408020600010200050100"

This is a multi-part message in MIME format.
--------------090408020600010200050100
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

I apologize in advance if this goes into esoteric algorithms a bit too 
much but I think this will get to an interesting idea to solve your 
problem. My background is physics particularly computer simulations of 
complex systems. Anyways in cosmology an interesting algorithm is called 
an n-body tree code (its been around for at least 20 years so a lot is 
available online about it). Since every object with mass (well in 
general relativity actually anything with energy but I digress) 
interacts with every other object with mass, you end up with the 
"n-body" problem. The number of interactions in a system goes as n(n-1) 
~= n^2 where n is the number of elements. This lead to a nightmare to do 
simulations of large systems, say two galaxies colliding. 1 billion X 1 
billion minus one is huge and effectively incalculable since you would 
have to calculate this each time you wanted to increment the simulation 
a tiny bit ahead in time. How do you get a reasonable approximation to 
the solution? The answer or at least one of them is n-body "tree codes".

You take advantage of the fact that the the force that one star feels 
from another falls off as 1/r^2 and importantly two stars far away from 
the first star but close together relatively have roughly the same 
magnitude and direction of the "r" vector. So you can simply clump them 
together, ie sum there masses, and the force is GM1(M"sum")/r^2. To do 
this efficiently numerically you break down the system using binary 
search trees. Thinking in 2D just to keep it simple, you divide the 
space into top left, top right, bottom left bottom right as a first 
approximation. Then continually do that until you end up with each 
element in its own box. When you figure out the forces you are going to 
apply to the system you just take the distance to the middle of the box 
that contains the ones you are going to consider together (the closer to 
the star in question the smaller the boxes need to be because the 
direction of r changes quicker the closer the boxes are to the star, but 
farther away you can use larger and larger boxes (which would contain a 
2D tree like structure descending to the point where each of the stars 
contained are trapped in there own little box), sum the number of stars 
in the box and presto.

How would this help you? Well if you encoded the "box hierachy", say 1 
for top left, 2 for top right, 3 for bottom left, 4 for bottom right, 
then you could specify the box that someone is in based on a string like 
"14234". To find the set of stars/points/whatever that are at least x 
away you just would have to do a range search for all the points with 
their location "string" larger than or equal to the location sting 
corresponding to the closest corner of the biggest box such that its 
corner is at least "x" units away. Quite good as a first approximation 
and the search algorithm should run as O(nlog(n)) which is a logirithmic 
decrease in computation time. Ie the 1 billion times 1 billion -1 
problem becomes 1 billion times ~9, much much nicer. Really difficult 
thing to explain without looking over a diagram in person I admit but 
hopefully it makes sense if you look up the algorithm online.


On 04/09/2010 05:01 PM, malsmith wrote:
>
>
> It's sort of an interesting problem - in RDBMS one relatively simple 
> approach would be calculate a rectangle that is X km by Y km with User 
> 1's location at the center.  So the rectangle is UserX - 10KmX , 
> UserY-10KmY to UserX+10KmX , UserY+10KmY
>
> Then you could query the database for all other users where that each 
> user considered is curUserX > UserX-10Km and curUserX < UserX+10KmX 
> and curUserY > UserY-10KmY and curUserY < UserY+10KmY
> * Not the 10KmX and 10KmY are really a translation from Kilometers to 
> degrees of  lat and longitude  (that you can find on a google search)
>
> With the right indexes this query actually runs pretty well.
>
> Translating that to Cassandra seems a bit complex at first - but you 
> could try something like pre-calculating a grid with the right 
> resolution (like a square of 5KM per side) and assign every user to a 
> particular grid ID.  That way you just calculate with grid ID User1 is 
> in then do a direct key lookup to get a list of the users in that same 
> grid id.
>
> A second approach would be to have to column families -- one that maps 
> a Latitude to a list of users who are at that latitude and a second 
> that maps users who are at a particular longitude.  You could do the 
> same rectange calculation above then do a get_slice range lookup to 
> get a list of users from range of latitude and a second list from the 
> range of longitudes.    You would then need to do a in-memory nested 
> loop to find the list of users that are in both lists.  This second 
> approach could cause some trouble depending on where you search and 
> how many users you really have -- some latitudes and longitudes have 
> many many people in them
>
> So, it seems some version of a chunking / grid id thing would be the 
> better approach.   If you let people zoom in or zoom out - you could 
> just have different column families for each level of zoom.
>
>
> I'm stuck on a stopped train so -- here is even more code:
>
> static Decimal GetLatitudeMiles(Decimal lat)
> {
> Decimal f = 0.0M;
> lat = Math.Abs(lat);
> f = 68.99M;
>          if (lat >= 0.0M && lat < 10.0M) { f = 68.71M; }
> else if (lat >= 10.0M && lat < 20.0M) { f = 68.73M; }
> else if (lat >= 20.0M && lat < 30.0M) { f = 68.79M; }
> else if (lat >= 30.0M && lat < 40.0M) { f = 68.88M; }
> else if (lat >= 40.0M && lat < 50.0M) { f = 68.99M; }
> else if (lat >= 50.0M && lat < 60.0M) { f = 69.12M; }
> else if (lat >= 60.0M && lat < 70.0M) { f = 69.23M; }
> else if (lat >= 70.0M && lat < 80.0M) { f = 69.32M; }
> else if (lat >= 80.0M) { f = 69.38M; }
>
> return f;
> }
>
>
> Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);
> Decimal MilesPerDegreeLongitude = ((Decimal) 
> Math.Abs(Math.Cos((Double) zList[0].Latitude))) * 24900.0M / 360.0M;
>                         dRadius = 10.0M  // ten miles
> Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
> Decimal deltaLong = dRadius / MilesPerDegreeLongitude;
>
> ps.TopLatitude = zList[0].Latitude - deltaLat;
> ps.TopLongitude = zList[0].Longitude - deltaLong;
> ps.BottomLatitude = zList[0].Latitude + deltaLat;
> ps.BottomLongitude = zList[0].Longitude + deltaLong;
>
>
>
> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
>> 2010/4/9 Onur AKTAS<onur.aktas@live.com  <mailto:onur.aktas@live.com>>:
>> >  ...
>> >  I'm trying to find out how do you perform queries with calculations on the
>> >  fly without inserting the data as calculated from the beginning.
>> >  Lets say we have latitude and longitude coordinates of all users and we have
>> >    Distance(from_lat, from_long, to_lat, to_long) function which
>> >  gives distance between lat/longs pairs in kilometers.
>>
>> I'm not an expert, but I think that it boils down to "MapReduce" and "Hadoop".
>>
>> I don't think that there's any top-down tutorial on those two words,
>> you'll have to research yourself starting here:
>>
>>   *http://en.wikipedia.org/wiki/MapReduce
>>
>>   *http://hadoop.apache.org/
>>
>>   *http://wiki.apache.org/cassandra/HadoopSupport
>>
>> I don't think it is all documented in any one place yet...
>>
>>   Paul Prescod
>>      
>


--------------090408020600010200050100
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
I apologize in advance if this goes into esoteric algorithms a bit too
much but I think this will get to an interesting idea to solve your
problem. My background is physics particularly computer simulations of
complex systems. Anyways in cosmology an interesting algorithm is
called an n-body tree code (its been around for at least 20 years so a
lot is available online about it). Since every object with mass (well
in general relativity actually anything with energy but I digress)
interacts with every other object with mass, you end up with the
"n-body" problem. The number of interactions in a system goes as n(n-1)
~= n^2 where n is the number of elements. This lead to a nightmare to
do simulations of large systems, say two galaxies colliding. 1 billion
X 1 billion minus one is huge and effectively incalculable since you
would have to calculate this each time you wanted to increment the
simulation a tiny bit ahead in time. How do you get a reasonable
approximation to the solution? The answer or at least one of them is
n-body "tree codes".<br>
<br>
You take advantage of the fact that the the force that one star feels
from another falls off as 1/r^2 and importantly two stars far away from
the first star but close together relatively have roughly the same
magnitude and direction of the "r" vector. So you can simply clump them
together, ie sum there masses, and the force is GM1(M"sum")/r^2. To do
this efficiently numerically you break down the system using binary
search trees. Thinking in 2D just to keep it simple, you divide the
space into top left, top right, bottom left bottom right as a first
approximation. Then continually do that until you end up with each
element in its own box. When you figure out the forces you are going to
apply to the system you just take the distance to the middle of the box
that contains the ones you are going to consider together (the closer
to the star in question the smaller the boxes need to be because the
direction of r changes quicker the closer the boxes are to the star,
but farther away you can use larger and larger boxes (which would
contain a 2D tree like structure descending to the point where each of
the stars contained are trapped in there own little box), sum the
number of stars in the box and presto.<br>
<br>
How would this help you? Well if you encoded the "box hierachy", say 1
for top left, 2 for top right, 3 for bottom left, 4 for bottom right,
then you could specify the box that someone is in based on a string
like "14234". To find the set of stars/points/whatever that are at
least x away you just would have to do a range search for all the
points with their location "string" larger than or equal to the
location sting corresponding to the closest corner of the biggest box
such that its corner is at least "x" units away. Quite good as a first
approximation and the search algorithm should run as O(nlog(n)) which
is a logirithmic decrease in computation time. Ie the 1 billion times 1
billion -1 problem becomes 1 billion times ~9, much much nicer. Really
difficult thing to explain without looking over a diagram in person I
admit but hopefully it makes sense if you look up the algorithm online.<br>
<br>
<br>
On 04/09/2010 05:01 PM, malsmith wrote:
<blockquote cite="mid:1270857660.3807.23.camel@malsmith-laptop"
 type="cite">
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta name="GENERATOR" content="GtkHTML/3.28.1">
  <br>
  <br>
It's sort of an interesting problem - in RDBMS one relatively simple
approach would be calculate a rectangle that is X km by Y km with User
1's location at the center.  So the rectangle is UserX - 10KmX ,
UserY-10KmY to UserX+10KmX , UserY+10KmY<br>
  <br>
Then you could query the database for all other users where that each
user considered is curUserX &gt; UserX-10Km and curUserX &lt;
UserX+10KmX and curUserY &gt; UserY-10KmY and curUserY &lt;
UserY+10KmY  <br>
* Not the 10KmX and 10KmY are really a translation from Kilometers to
degrees of  lat and longitude  (that you can find on a google search)<br>
  <br>
With the right indexes this query actually runs pretty well.   <br>
  <br>
Translating that to Cassandra seems a bit complex at first - but you
could try something like pre-calculating a grid with the right
resolution (like a square of 5KM per side) and assign every user to a
particular grid ID.  That way you just calculate with grid ID User1 is
in then do a direct key lookup to get a list of the users in that same
grid id. <br>
  <br>
A second approach would be to have to column families -- one that maps
a Latitude to a list of users who are at that latitude and a second
that maps users who are at a particular longitude.  You could do the
same rectange calculation above then do a get_slice range lookup to get
a list of users from range of latitude and a second list from the range
of longitudes.    You would then need to do a in-memory nested loop to
find the list of users that are in both lists.  This second approach
could cause some trouble depending on where you search and how many
users you really have -- some latitudes and longitudes have many many
people in them<br>
  <br>
So, it seems some version of a chunking / grid id thing would be the
better approach.   If you let people zoom in or zoom out - you could
just have different column families for each level of zoom.<br>
  <br>
  <br>
I'm stuck on a stopped train so -- here is even more code:<br>
  <br>
static Decimal GetLatitudeMiles(Decimal lat) <br>
{<br>
Decimal f = 0.0M;<br>
lat = Math.Abs(lat);<br>
f = 68.99M;<br>
         if (lat &gt;= 0.0M &amp;&amp; lat &lt; 10.0M) { f = 68.71M; } <br>
else if (lat &gt;= 10.0M &amp;&amp; lat &lt; 20.0M) { f = 68.73M; }<br>
else if (lat &gt;= 20.0M &amp;&amp; lat &lt; 30.0M) { f = 68.79M; }<br>
else if (lat &gt;= 30.0M &amp;&amp; lat &lt; 40.0M) { f = 68.88M; }<br>
else if (lat &gt;= 40.0M &amp;&amp; lat &lt; 50.0M) { f = 68.99M; }<br>
else if (lat &gt;= 50.0M &amp;&amp; lat &lt; 60.0M) { f = 69.12M; }<br>
else if (lat &gt;= 60.0M &amp;&amp; lat &lt; 70.0M) { f = 69.23M; }<br>
else if (lat &gt;= 70.0M &amp;&amp; lat &lt; 80.0M) { f = 69.32M; }<br>
else if (lat &gt;= 80.0M) { f = 69.38M; }<br>
  <br>
return f;<br>
}<br>
  <br>
  <br>
Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);<br>
Decimal MilesPerDegreeLongitude = ((Decimal) Math.Abs(Math.Cos((Double)
zList[0].Latitude))) * 24900.0M / 360.0M;<br>
                        dRadius = 10.0M  // ten miles<br>
Decimal deltaLat = dRadius / MilesPerDegreeLatitude;<br>
Decimal deltaLong = dRadius / MilesPerDegreeLongitude;<br>
  <br>
ps.TopLatitude = zList[0].Latitude - deltaLat;<br>
ps.TopLongitude = zList[0].Longitude - deltaLong;<br>
ps.BottomLatitude = zList[0].Latitude + deltaLat;<br>
ps.BottomLongitude = zList[0].Longitude + deltaLong;<br>
  <br>
  <br>
  <br>
On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
  <blockquote type="CITE">
    <pre>2010/4/9 Onur AKTAS &lt;<a moz-do-not-send="true"
 href="mailto:onur.aktas@live.com">onur.aktas@live.com</a>&gt;:
&gt; ...
&gt; I'm trying to find out how do you perform queries with calculations on the
&gt; fly without inserting the data as calculated from the beginning.
&gt; Lets say we have latitude and longitude coordinates of all users and we have
&gt;  Distance(from_lat, from_long, to_lat, to_long) function which
&gt; gives distance between lat/longs pairs in kilometers.

I'm not an expert, but I think that it boils down to "MapReduce" and "Hadoop".

I don't think that there's any top-down tutorial on those two words,
you'll have to research yourself starting here:

 * <a moz-do-not-send="true"
 href="http://en.wikipedia.org/wiki/MapReduce">http://en.wikipedia.org/wiki/MapReduce</a>

 * <a moz-do-not-send="true" href="http://hadoop.apache.org/">http://hadoop.apache.org/</a>

 * <a moz-do-not-send="true"
 href="http://wiki.apache.org/cassandra/HadoopSupport">http://wiki.apache.org/cassandra/HadoopSupport</a>

I don't think it is all documented in any one place yet...

 Paul Prescod
    </pre>
  </blockquote>
  <br>
</blockquote>
<br>
</body>
</html>

--------------090408020600010200050100--