cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Malcolm Smith <malsm...@treehousesystems.com>
Subject Re: How to perform queries on Cassandra?
Date Sat, 10 Apr 2010 01:11:52 GMT
Mike are you stuck on a train too? :-)



On Apr 9, 2010, at 8:51 PM, Mike Gallamore <mike.e.gallamore@googlemail.com 
 > wrote:

> I apologize in advance if this goes into esoteric algorithms a bit  
> too much but I think this will get to an interesting idea to solve  
> your problem. My background is physics particularly computer  
> simulations of complex systems. Anyways in cosmology an interesting  
> algorithm is called an n-body tree code (its been around for at  
> least 20 years so a lot is available online about it). Since every  
> object with mass (well in general relativity actually anything with  
> energy but I digress) interacts with every other object with mass,  
> you end up with the "n-body" problem. The number of interactions in  
> a system goes as n(n-1) ~= n^2 where n is the number of elements.  
> This lead to a nightmare to do simulations of large systems, say two  
> galaxies colliding. 1 billion X 1 billion minus one is huge and  
> effectively incalculable since you would have to calculate this each  
> time you wanted to increment the simulation a tiny bit ahead in  
> time. How do you get a reasonable approximation to the solution? The  
> answer or at least one of them is n-body "tree codes".
>
> You take advantage of the fact that the the force that one star  
> feels from another falls off as 1/r^2 and importantly two stars far  
> away from the first star but close together relatively have roughly  
> the same magnitude and direction of the "r" vector. So you can  
> simply clump them together, ie sum there masses, and the force is GM1 
> (M"sum")/r^2. To do this efficiently numerically you break down the  
> system using binary search trees. Thinking in 2D just to keep it  
> simple, you divide the space into top left, top right, bottom left  
> bottom right as a first approximation. Then continually do that  
> until you end up with each element in its own box. When you figure  
> out the forces you are going to apply to the system you just take  
> the distance to the middle of the box that contains the ones you are  
> going to consider together (the closer to the star in question the  
> smaller the boxes need to be because the direction of r changes  
> quicker the closer the boxes are to the star, but farther away you  
> can use larger and larger boxes (which would contain a 2D tree like  
> structure descending to the point where each of the stars contained  
> are trapped in there own little box), sum the number of stars in the  
> box and presto.
>
> How would this help you? Well if you encoded the "box hierachy", say  
> 1 for top left, 2 for top right, 3 for bottom left, 4 for bottom  
> right, then you could specify the box that someone is in based on a  
> string like "14234". To find the set of stars/points/whatever that  
> are at least x away you just would have to do a range search for all  
> the points with their location "string" larger than or equal to the  
> location sting corresponding to the closest corner of the biggest  
> box such that its corner is at least "x" units away. Quite good as a  
> first approximation and the search algorithm should run as O(nlog 
> (n)) which is a logirithmic decrease in computation time. Ie the 1  
> billion times 1 billion -1 problem becomes 1 billion times ~9, much  
> much nicer. Really difficult thing to explain without looking over a  
> diagram in person I admit but hopefully it makes sense if you look  
> up the algorithm online.
>
>
> On 04/09/2010 05:01 PM, malsmith wrote:
>>
>>
>>
>> It's sort of an interesting problem - in RDBMS one relatively  
>> simple approach would be calculate a rectangle that is X km by Y km  
>> with User 1's location at the center.  So the rectangle is UserX -  
>> 10KmX , UserY-10KmY to UserX+10KmX , UserY+10KmY
>>
>> Then you could query the database for all other users where that  
>> each user considered is curUserX > UserX-10Km and curUserX < UserX 
>> +10KmX and curUserY > UserY-10KmY and curUserY < UserY+10KmY
>> * Not the 10KmX and 10KmY are really a translation from Kilometers  
>> to degrees of  lat and longitude  (that you can find on a google  
>> search)
>>
>> With the right indexes this query actually runs pretty well.
>>
>> Translating that to Cassandra seems a bit complex at first - but  
>> you could try something like pre-calculating a grid with the right  
>> resolution (like a square of 5KM per side) and assign every user to  
>> a particular grid ID.  That way you just calculate with grid ID  
>> User1 is in then do a direct key lookup to get a list of the users  
>> in that same grid id.
>>
>> A second approach would be to have to column families -- one that  
>> maps a Latitude to a list of users who are at that latitude and a  
>> second that maps users who are at a particular longitude.  You  
>> could do the same rectange calculation above then do a get_slice  
>> range lookup to get a list of users from range of latitude and a  
>> second list from the range of longitudes.    You would then need to  
>> do a in-memory nested loop to find the list of users that are in  
>> both lists.  This second approach could cause some trouble  
>> depending on where you search and how many users you really have --  
>> some latitudes and longitudes have many many people in them
>>
>> So, it seems some version of a chunking / grid id thing would be  
>> the better approach.   If you let people zoom in or zoom out - you  
>> could just have different column families for each level of zoom.
>>
>>
>> I'm stuck on a stopped train so -- here is even more code:
>>
>> static Decimal GetLatitudeMiles(Decimal lat)
>> {
>> Decimal f = 0.0M;
>> lat = Math.Abs(lat);
>> f = 68.99M;
>>          if (lat >= 0.0M && lat < 10.0M) { f = 68.71M; }
>> else if (lat >= 10.0M && lat < 20.0M) { f = 68.73M; }
>> else if (lat >= 20.0M && lat < 30.0M) { f = 68.79M; }
>> else if (lat >= 30.0M && lat < 40.0M) { f = 68.88M; }
>> else if (lat >= 40.0M && lat < 50.0M) { f = 68.99M; }
>> else if (lat >= 50.0M && lat < 60.0M) { f = 69.12M; }
>> else if (lat >= 60.0M && lat < 70.0M) { f = 69.23M; }
>> else if (lat >= 70.0M && lat < 80.0M) { f = 69.32M; }
>> else if (lat >= 80.0M) { f = 69.38M; }
>>
>> return f;
>> }
>>
>>
>> Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);
>> Decimal MilesPerDegreeLongitude = ((Decimal) Math.Abs(Math.Cos 
>> ((Double) zList[0].Latitude))) * 24900.0M / 360.0M;
>>                         dRadius = 10.0M  // ten miles
>> Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
>> Decimal deltaLong = dRadius / MilesPerDegreeLongitude;
>>
>> ps.TopLatitude = zList[0].Latitude - deltaLat;
>> ps.TopLongitude = zList[0].Longitude - deltaLong;
>> ps.BottomLatitude = zList[0].Latitude + deltaLat;
>> ps.BottomLongitude = zList[0].Longitude + deltaLong;
>>
>>
>>
>> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
>>>
>>> 2010/4/9 Onur AKTAS <onur.aktas@live.com>:
>>> > ...
>>> > I'm trying to find out how do you perform queries with  
>>> calculations on the
>>> > fly without inserting the data as calculated from the beginning.
>>> > Lets say we have latitude and longitude coordinates of all users  
>>> and we have
>>> >  Distance(from_lat, from_long, to_lat, to_long) function which
>>> > gives distance between lat/longs pairs in kilometers.
>>>
>>> I'm not an expert, but I think that it boils down to "MapReduce"  
>>> and "Hadoop".
>>>
>>> I don't think that there's any top-down tutorial on those two words,
>>> you'll have to research yourself starting here:
>>>
>>>  * http://en.wikipedia.org/wiki/MapReduce
>>>
>>>  * http://hadoop.apache.org/
>>>
>>>  * http://wiki.apache.org/cassandra/HadoopSupport
>>>
>>> I don't think it is all documented in any one place yet...
>>>
>>>  Paul Prescod
>>>
>>
>

Mime
View raw message