hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subir S <subir.sasiku...@gmail.com>
Subject Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Date Fri, 02 Mar 2012 14:21:08 GMT
On Fri, Mar 2, 2012 at 12:38 PM, Harsh J <harsh@cloudera.com> wrote:

> On Fri, Mar 2, 2012 at 10:18 AM, Subir S <subir.sasikumar@gmail.com>
> wrote:
> > Hello Folks,
> >
> > Are there any pointers to such comparisons between Apache Pig and Hadoop
> > Streaming Map Reduce jobs?
> I do not see why you seek to compare these two. Pig offers a language
> that lets you write data-flow operations and runs these statements as
> a series of MR jobs for you automatically (Making it a great tool to
> use to get data processing done really quick, without bothering with
> code), while streaming is something you use to write non-Java, simple
> MR jobs. Both have their own purposes.

Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.

> > Also there was a claim in our company that Pig performs better than Map
> > Reduce jobs? Is this true? Are there any such benchmarks available
> Pig _runs_ MR jobs. It does do job design (and some data)
> optimizations based on your queries, which is what may give it an edge
> over designing elaborate flows of plain MR jobs with tools like
> Oozie/JobControl (Which takes more time to do). But regardless, Pig
> only makes it easy doing the same thing with Pig Latin statements for
> you.

I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!

> --
> Harsh J

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message