spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashic Mahtab <as...@live.com>
Subject Best way to do a lookup in Spark
Date Thu, 27 Nov 2014 13:36:32 GMT
Hi,
I'm looking to do an iterative algorithm implementation with data coming in from Cassandra.
This might be a use case for GraphX, however the ids are non-integral, and I would like to
avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and
the current implementation does a lot of db access. It's fine (one half of a full iteration
is done in 25 minutes on 3M+ vertices), and use of Spark's cache() has achieved that. However,
each full iteration is 50 minutes, and I would like to improve that.

A high level overview of what I'm trying to do is:

1) Vertex structure (id, in, out, aScore, hScore).
2) Load all the vertices into memory (simple enough).
3) Have a lookup vertexid -> (aScore, hScore) in memory (currently, this is where I need
to do a lot of cassandra queries...which are very fast, but hoping to avoid).
4) Iterate n times in 2 statges:
    In the Hub Stage:
    a) Foreach vertex, get the sum of aScores for vertices it points to. Cache this.
    b) From the cache, get the max score. Divide each score in the cache by the max.
    c) Get rid of the cache.
    d) Update the lookup (in (3)) with the new hScores.
   
    In the Authority Stage:
    a) Foreach vertex, get the sum of hScores for vertices that point to it. Cache this.
    b) From the cache, get the max score. Divide each score in the cache by the max.
    c) Get rid of the cache.
    d) Update the lookup (in (3)) with the new aScores.
 
5) Update the final aScores and hScores from memory to Cassandra.

The one bit that I don't have now is the in memory lookup (i.e. to get the hScores and aScores
of neighbours in (4-a ). As such, I have to query cassandra for each vertex x times where
x is the number of neighbours. And as those values are used in the next iteration, I also
have to update cassandra for each run. Is it possibly to have this as an in memory distributed
lookup so that I can deal with the data store at the start and end?

One option is to identify clusters and run HITS for each cluster entirely in memory, however
if there's a simpler way I'd prefer that.

Regards,
Ashic.
 		 	   		  
Mime
View raw message