incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: [DISCUSS] Crunch to join the Apache Incubator
Date Fri, 18 May 2012 12:20:13 GMT
Hey JB,

I think that the underlying data model is the main difference. Pig,
like Hive and Cascading, has a relational data model-- the fundamental
data type is a Tuple of values. Crunch is closer to bare-metal
MapReduce; it doesn't impose a data model on the developer, and I
think that it ends up being easier to use Crunch when you're working
with data types that would otherwise require you to write lots of UDFs
in Pig-- for example, time series, matrices, or HDF5 files. [1]

The other major difference is, as you alluded to, the programming
environment-- Crunch is a Java library that also has a Scala wrapper,
while Pig is, like Hive, a domain-specific language. Much like the
data model, there is a tradeoff here as well-- Crunch requires more
skilled developers, but it offers those developers the benefits of a
real programming language, like for loops, debugging tools, and a rich
ecosystem of testing frameworks.

I am a Pig fan (see, for instance, [2] and [3]), and I see the tools
as complements, not competitors. Crunch is used by developers who are
building ETL pipelines in which performance and thorough testing are
critical, and Pig is used by analysts and data scientists in order to
run thousands of queries over the results of those ETL pipelines.

Best,
Josh

[1] http://www.hdfgroup.org/HDF5/
[2] http://www.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
[3] http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs

On Fri, May 18, 2012 at 1:49 AM, Jean-Baptiste Onofré <jb@nanthrax.net> wrote:
> Hi Josh,
>
> Could you compare with Pig ? Is Scala support the main difference ?
>
> Thanks,
> Regards
> JB
>
>
> On 05/16/2012 02:23 AM, Josh Wills wrote:
>>
>> Hi all,
>>
>> I would like to propose Crunch, a library for writing MapReduce
>> pipelines in Java and Scala, as an Apache Incubator project. The
>> proposal is here:
>>
>> http://wiki.apache.org/incubator/CrunchProposal
>>
>> We would gladly welcome additional volunteers to act as mentors on the
>> project, so if this sounds like your cup of tea, please feel free to
>> sign up or let us know.
>>
>> Thanks!
>> Josh
>>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message