hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thejas M Nair (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-506) Does pig need a NATIVE keyword?
Date Thu, 26 Aug 2010 15:56:56 GMT

     [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Thejas M Nair updated PIG-506:

    Attachment: PIG-506.2.patch

PIG-506.2.patch has
- Changes to get mapreduce operator working with new logical plan
- Changes to LO/PO Native operators - The store and load for the operator are no longer within
it, they are part of the plan. As a result, several changes in visitors made for handling
the load/store within LONative has been reverted.
- Fix for reporting failure when MR job corresponding to native operator fails.
- Removed TestTestNativeMapReduce from exclude list in ant target.

Some issues still to be fixed, which i will address as part of new jiras -
-  PIG-1570  The code path for handling failure in MR job corresponding to native MR is different
and does not have the same behavior. 
-  PIG-1571 If the output file for native MR exist, the query does not fail at compile time,
it fails only at runtime. This file loaded in the nested load of native MR operator, it should
be possible to check for this file. 

> Does pig need a NATIVE keyword?
> -------------------------------
>                 Key: PIG-506
>                 URL: https://issues.apache.org/jira/browse/PIG-506
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Aniket Mokashi
>            Priority: Minor
>             Fix For: 0.8.0
>         Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch,
NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.patch, TestWordCount.jar
> Assume a user had a job that broke easily into three pieces.  Further assume that pieces
one and three were easily expressible in pig, but that piece two needed to be written in map
reduce for whatever reason (performance, something that pig could not easily express, legacy
job that was too important to change, etc.).  Today the user would either have to use map
reduce for the entire job or manually handle the stitching together of pig and map reduce
jobs.  What if instead pig provided a NATIVE keyword that would allow the script to pass off
the data stream to the underlying system (in this case map reduce).  The semantics of NATIVE
would vary by underlying system.  In the map reduce case, we would assume that this indicated
a collection of one or more fully contained map reduce jobs, so that pig would store the data,
invoke the map reduce jobs, and then read the resulting data to continue.  It might look something
like this:
> {code}
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> ...
> {code}
> This differs from streaming in that it allows the user to insert an arbitrary amount
of native processing, whereas streaming allows the insertion of one binary.  It also differs
in that, for streaming, data is piped directly into and out of the binary as part of the pig
pipeline.  Here the pipeline would be broken, data written to disk, and the native block invoked,
then data read back from disk.
> Another alternative is to say this is unnecessary because the user can do the coordination
from java, using the PIgServer interface to run pig and calling the map reduce job explicitly.
 The advantages of the native keyword are that the user need not be worried about coordination
between the jobs, pig will take care of it.  Also the user can make use of existing java applications
without being a java programmer.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message