hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antonio Magnaghi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-32) Abstraction Layer to decouple Pig from Back-End
Date Fri, 30 Nov 2007 16:40:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547198

Antonio Magnaghi commented on PIG-32:

Attaching some feedback from Trevor (Galago project)

From: Trevor Strohman [mailto:strohman@cs.umass.edu] 
Sent: Wednesday, November 21, 2007 5:01 PM
To: Antonio Magnaghi
Subject: Re: galago


Wow, you've done a lot of work here.  This looks great.  I hope you end up with lots of other

I'll just give you comments as I read the PigAbstractionLayer page.  Feel free to e-mail again
if you want different (or more) information.

The DataStorage interface looks great.  I'd consider using this in Galago for file storage
(I've always wanted to make the Hadoop DFS an option for data storage in Galago).  However,
since Galago uses the native filesystem right now, I wouldn't have to implement this interface.

Should addFromResource be a part of the configuration interface?  This isn't something I want
to implement myself (assuming there will be lots of these PigBackEndProperties objects around).
 Maybe you could have a standard implementation that I could use.

A suggestion for the getStatistics() method in ExecutionEngine: perhaps part of the statistics
object could be a set of objects that can be tracked using Java Management Extensions (JMX).
 At some point I plan to make Galago JMX-ready, which would give you a lot of information
about current running jobs, etc.

You might want a method on ExecutionEnginePhysicalPlan that allows the caller to block waiting
for completion.

I think the API as specified seems like something I could implement for Galago.

It's not clear from the API how a new LogicalPlan can refer to results generated by previous
LogicalPlans that have already been compiled and executed.  I never made this work in Galago
with the current implementation.  Also, it seems like you might want to be able to ask a completed
PhysicalPlan for a particular computed tuple stream.  Again, I never figured out how to do
that in the current Pig (at least not in a way that would work with Galago).


On Nov 21, 2007, at 5:57 PM, Antonio Magnaghi wrote:

Hi Trevor,
I would like to follow up on the email exchange we had few weeks ago about Galago and Pig.
In particular, at YRL we have decided to suggest, inside the Apache Pig incubator, some extensions
to Pig that could make it easier to integrate Pig with different back-ends. The main approach
is outlined at: http://wiki.apache.org/pig/PigAbstractionLayer.
At this point in time, I'm collecting some initial feedback before starting the actual implementation.
Do you have possible requirements in order to allow Pig to better support Galago? As you have
direct experience on some of the issues involved, I'd appreciate if you could share some of
your thoughts on the design proposed.
From: Trevor Strohman [mailto:strohman@cs.umass.edu] 
Sent: Tuesday, October 23, 2007 10:59 AM
To: Antonio Magnaghi
Subject: Re: galago
I'll do my best to answer your questions by e-mail, but you might also find it useful to download
the Galago code and my version of Pig.  In the galago/java/pig-galago directory, you'll find
a file called "pig-galago.patch" which contains all of the changes I made to the current Pig
distribution to make it work with Galago.  The whole download is here:
Before I start, I should mention that Galago is primarily meant to be a search engine toolkit,
kind of like Lucene.  It happens to have its own MapReduce-like job execution engine called
TupleFlow, and Pig can run on top of that.  TupleFlow has some similarities to the Pig model,
in that strongly-typed tuples flow between computational steps to create an answer.

1.) the high-level language the user can utilize to specify the tuple-processing;
Users usually create TupleFlow jobs by creating an XML job specification.  The job specification
allows the user to describe what Java objects will be used and how they should be connected
together in an execution graph.  TupleFlow then schedules these components out onto computational
nodes, sometimes with the help of a job execution system (like Grid Engine or Condor).  TupleFlow
is probably most similar to Microsoft's Dryad system.
In the Pig/Galago port, I translate Pig jobs into TupleFlow jobs in code, so no XML files
are made.
2.) how the tuple processing specification is mapped to a physical processing plan;
I know that Pig has both a high-level and low-level specification.  Compared to Pig, TupleFlow
really only has a low-level processing language.  Pig is TupleFlow's high level language (when
I want one).

3.) what type of platform/computational model is used.
I'm not exactly sure how to answer this.  It's all in Java, and objects are passed around
using files on a shared file system.  Unlike Pig, Galago typically creates a different Java
class for each type of tuple sent through the system.  When running Pig jobs on Galago, I've
hacked Galago a little bit to allow it to use Pig's Tuple type.

 I understand that the data/tuple processing is carried out by porting/extending the Pig or
Pig-like front-end to run on a back-end that is not Hadoop/map-reduce? Is this correct?
Yes, that's right.  It might be best to think of TupleFlow as something that implements most
of the physical layer of Pig as well as a MapReduce execution engine.

> Abstraction Layer to decouple Pig from Back-End
> -----------------------------------------------
>                 Key: PIG-32
>                 URL: https://issues.apache.org/jira/browse/PIG-32
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Antonio Magnaghi
>            Assignee: Antonio Magnaghi
> I'm opening a new issue to track the development work to support an abstraction layer
for Pig as defined at http://wiki.apache.org/pig/PigAbstractionLayer

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message