hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Re: does anyone have idea on how to run multiple sequential jobs with bash script
Date Thu, 12 Jun 2008 01:04:41 GMT
Thanks Ted..

Couple quick comments.

At one level Cascading is a MapReduce query planner, just like PIG.  
Except the API is for public consumption and fully extensible, in PIG  
you typically interact with the PigLatin syntax. Subsequently, with  
Cascading, you can layer your own syntax on top of the API. Currently  
there is Groovy support (Groovy is used to assemble the work, it does  
not run on the mappers or reducers). I hear rumors about Jython  
elsewhere.

A couple groovy examples (note these are obviously trivial, the dsl  
can absorb tremendous complexity if need be)...
http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/wordcount.groovy
http://code.google.com/p/cascading/source/browse/trunk/cascading.groovy/sample/widefinder.groovy

Since Cascading is in part a 'planner', it actually builds internally  
a new representation from what the developer assembled and renders  
out  the necessary map/reduce jobs (and transparently links them) at  
runtime. As Hadoop evolves, the planner will incorporate the new  
features and leverage them transparently. Plus there are opportunities  
for identifying patterns and applying different strategies  
(hypothetically map side vs reduce side joins, for one). It is also  
conceivable (but untried) that different planners can exist to target  
different systems other than Hadoop (making your code/libraries  
portable). Much of this is true for PIG as well.
http://www.cascading.org/documentation/overview.html

Also, Cascading will at some point provide a PIG adapter, allowing  
PigLatin queries to participate in a larger Cascading 'Cascade' (the  
topological scheduler). Cascading is great with integration,  
connecting things outside Hadoop with stuff to be done inside Hadoop.  
And PIG looks like a great way to concisely represent a complex  
solution and execute it. There isn't any reason they can't work  
together (it has always been the intention).

The takeaway is that with Cascading and PIG, users do not think in  
MapReduce. With PIG, you think in PigLatin. With Cascading, you can  
use the pipe/filter based API, or use your favorite scripting language  
and build a DSL for your problem domain.

Many companies have done similar things internally, but they tend to  
be nothing more than a scriptable way to write a map/reduce job and  
glue them together. You still think in MapReduce, which in my opinion  
doesn't scale well.

My (biased) recommendation is this.

Build out your application in Cascading. If part of the problem is  
best represented in PIG, no worries use PIG and feed and clean up  
after PIG with Cascading. And if you see a solvable bottleneck, and we  
can't convince the planner to recognize the pattern and plan better,  
replace that piece of the process with a custom MapReduce job (or more).

Solve your problem first, then optimize the solution, if need be.

ckw

On Jun 11, 2008, at 5:00 PM, Ted Dunning wrote:

> Pig is much more ambitious than cascading.  Because of the  
> ambitions, simple
> things got overlooked.  For instance, something as simple as  
> computing a
> file name to load is not possible in pig, nor is it possible to write
> functions in pig.  You can hook to Java functions (for some things),  
> but you
> can't really write programs in pig.  On the other hand, pig may  
> eventually
> provide really incredible capabilities including program rewriting and
> optimization that would be incredibly hard to write directly in Java.
>
> The point of cascading was simply to make life easier for a normal
> Java/map-reduce programmer.  It provides an abstraction for gluing  
> together
> several map-reduce programs and for doing a few common things like  
> joins.
> Because you are still writing Java (or Groovy) code, you have all of  
> the
> functionality you always had.  But, this same benefit costs you the  
> future
> in terms of what optimizations are likely to ever be possible.
>
> The summary for us (especially 4-6 months ago when we were deciding)  
> is that
> cascading is good enough to use now and pig will probably be more  
> useful
> later.
>
> On Wed, Jun 11, 2008 at 4:19 PM, Haijun Cao <haijun@kindsight.net>  
> wrote:
>
>>
>> I find cascading very similar to pig, do you care to provide your  
>> comment
>> here? If map reduce programmers are to go to the next level  
>> (scripting/query
>> language), which way to go?
>>
>>
>>

--
Chris K Wensel
chris@wensel.net
http://chris.wensel.net/
http://www.cascading.org/






Mime
View raw message