spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Apache Spark and Graphx for Real Time Analytics
Date Tue, 08 Apr 2014 21:12:53 GMT
Nick and Koert summarized it pretty well. Just to clarify and give some
concrete examples.

If you want to start with a specific vertex, and follow some path, it is
probably easier and faster to use some key values store or even MySQL or a
graph database.

If you want to count the average length of paths between all nodes, or if
you want to compute the pair wise shortest path for all vertices, GraphX
will likely be way faster.






On Tue, Apr 8, 2014 at 2:03 PM, Nick Pentreath <nick.pentreath@gmail.com>wrote:

> Likely neither will give real-time for full-graph traversal, no. And once
> in memory, GraphX would definitely be faster for "breadth-first" traversal.
>
> But for "vertex-centric" traversals (starting from a vertex and traversing
> edges from there, such as "friends of friends" queries etc) then Titan is
> optimized for that use case.
>
>
>
>
> On Tue, Apr 8, 2014 at 10:56 PM, Evan Chan <ev@ooyala.com> wrote:
>
> > I doubt Titan would be able to give you traversal of billions of nodes in
> > real-time either.   In-memory traversal is typically much faster than
> > Cassandra-based tree traversal, even including in-memory caching.
> >
> >
> > On Tue, Apr 8, 2014 at 1:23 PM, Nick Pentreath <nick.pentreath@gmail.com
> > >wrote:
> >
> > > GraphX, like Spark, will not typically be "real-time" (where by
> > "real-time"
> > > here I assume you mean of the order of a few 10s-100s ms, up to a few
> > > seconds).
> > >
> > > Spark can in some cases approach the upper boundary of this definition
> (a
> > > second or two, possibly less) when data is cached in memory and the
> > > computation is not "too heavy", while Spark Streaming may be able to
> get
> > > closer to the mid-to-upper boundary of this under similar conditions,
> > > especially if aggregating over relatively small windows.
> > >
> > > However, for this use case (while I haven't used GraphX yet) I would
> say
> > > something like Titan (https://github.com/thinkaurelius/titan/wiki) or
> a
> > > similar OLTP graph DB may be what you're after. But this depends on
> what
> > > kind of graph traversal you need.
> > >
> > >
> > >
> > >
> > > On Tue, Apr 8, 2014 at 10:02 PM, love2dishtech <
> love2dishtech@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > Is Graphx on top of Apache Spark, is able to process the large scale
> > > > distributed graph traversal and compute, in real time. What is the
> > query
> > > > execution engine distributing the query on top of graphx and apache
> > > spark.
> > > > My typical use case is a large scale distributed graph traversal in
> > real
> > > > time, with billions of nodes.
> > > >
> > > > Thanks,
> > > > Love.
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-and-Graphx-for-Real-Time-Analytics-tp6261.html
> > > > Sent from the Apache Spark Developers List mailing list archive at
> > > > Nabble.com.
> > > >
> > >
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > ev@ooyala.com  |
> >
> > <http://www.ooyala.com/>
> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala
> ><
> > http://www.twitter.com/ooyala>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message