spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <philip.og...@oracle.com>
Subject Re: compare/contrast Spark with Cascading
Date Mon, 28 Oct 2013 21:20:58 GMT
Hi Paco,

Thank you for the various links and thoughts.  Yes - "workflow 
abstraction layer" is a better term for what I meant.  I have two 
questions for you:

1) when you say "Cascading is relatively agnostic about the distributed 
topology underneath it" I take that as a hedge that suggests that while 
it could be possible to run Spark underneath Cascading this is not 
something commonly done or would necessarily be straightforward.  Is 
this an unfair reading between the lines - or is 
Cascading-on-top-of-Spark an established technology stack that people 
are actually using?

2) Can you give an example of how Cascading is at a higher level of 
abstraction than Spark?  When I look at the landing page for Scalding 
(which runs on top of Cascading) and JCascalog (which claims to yet 
another level of abstraction above Cascading) I see getting started code 
snippets that look exactly like the sort of thing you do with Spark.  I 
can understand why this is a useful approach for a getting started page 
but it doesn't shed light on how these two technologies might 
differentiate from Spark with respect to the abstraction layer they 
target.  Any thoughts on this (or examples!) would be helpful to me.

Thanks,
Philip


On 10/28/2013 1:00 PM, Paco Nathan wrote:
> Hi Philip,
>
> Cascading is relatively agnostic about the distributed topology 
> underneath it, especially as of the 2.0 release over a year ago. 
> There's been some discussion about writing a flow planner for Spark -- 
> e.g., which would replace the Hadoop flow planner. Not sure if there's 
> active work on that yet.
>
> There are a few commercial workflow abstraction layers (probably what 
> was meant by "application layer" ?), in terms of the Cascading family 
> (incl. Cascalog, Scalding), and also Actian's integration of 
> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in 
> the Py data stack.
>
> Spark would not be at the same level of abstraction as Cascading 
> (business logic, effectively); however, something like MLbase is 
> ostensibly intended for that http://www.mlbase.org/
>
> With respect to Spark, two other things to watch... One would 
> definitely be the Py data stack and ability to integrate with PySpark, 
> which is turning out to be very power abstraction -- quite close to a 
> large segment of industry needs. The other project to watch, on the 
> Scala side, is Summingbird and it's evolution at Twitter: 
> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>
> Paco
> http://amazon.com/dp/1449358721/
>
>
> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren 
> <philip.ogren@oracle.com <mailto:philip.ogren@oracle.com>> wrote:
>
>
>     My team is investigating a number of technologies in the Big Data
>     space.  A team member recently got turned on to Cascading
>     <http://www.cascading.org/about-cascading/> as an application
>     layer for orchestrating complex workflows/scenarios.  He asked me
>     if Spark had an "application layer"?  My initial reaction is "no"
>     that Spark would not have a separate orchestration/application
>     layer.  Instead, the core Spark API (along with Streaming) would
>     compete directly with Cascading for this kind of functionality and
>     that the two would not likely be all that complementary.  I
>     realize that I am exposing my ignorance here and could be way
>     off.  Is there anyone who knows a bit about both of these
>     technologies who could speak to this in broad strokes?
>
>     Thanks!
>     Philip
>
>


Mime
View raw message