pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pallavi Rao <pallavi....@inmobi.com>
Subject Re: Why pig on spark use RDD API rather than DataFrame API ?
Date Mon, 09 Jan 2017 05:03:05 GMT
Yes. That was the first question I asked when I started work on Pig on
Spark. After investigating a little more, I realized that the current
design does not allow for easy use of DataFrame API. We do an operator by
operator substitution and use Tuple as the datatype. We would end up
converting RDDs to DataFrames and visa-versa, which is not really optimal.

So, as Kelly said, we should take up that optimization post one release.
And, we would even move to Dataset API then.

On Mon, Jan 9, 2017 at 7:53 AM, Zhang, Liyun <liyun.zhang@intel.com> wrote:

> Hi Jeff:
>   Thanks for your interest, when this project is started (Aug in 2014)
> DataFrame API is not available and this is why we don't use this in the
> project.  Engineer in InMobi raised similar idea before. In my view, if
> DataFrame API is more suitable than RDD API, we can consider this in late
> optimization work after first release. Now you can file a subtask on
> PIG-4856(an umbrella jira for optimization work) and work on it if have
> interest.
> Best Regards
> Kelly Zhang/Zhang,Liyun
> -----Original Message-----
> From: Jeff Zhang [mailto:zjffdu@gmail.com]
> Sent: Sunday, January 8, 2017 10:13 AM
> To: dev@pig.apache.org
> Subject: Why pig on spark use RDD API rather than DataFrame API ?
> Hi Folks,
> I am very interested on the project of pig on spark. When I read the code,
> I find that the current implementation is based on spark RDD API. I don't
> know the original background (maybe when this project is started, DataFrame
> API is not available) , but for now I feel DataFrame API might be more
> suitable than RDD API. Here's 2 advantages of DataFrame API I can think of:
> 1.  DataFrame API is easier to use than RDD API, although it is not
> flexible than RDD, but I think Pig's tuple data structure is very similar
> with that of DataFrame. I think it should be able to map each pig operation
> to data frame operation. If not, we can give feedback to spark community.
> 2.  Spark's catalyst provide lots of optimization on DataFrame. If we use
> DataFrame API, we can leverage lots of optimization in catalyst rather than
> reinvent the wheel in pig.
> What do you think ? Thanks

The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message