hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mathieu Poumeyrol (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-206) Right granularity for a pig script
Date Wed, 16 Apr 2008 14:53:21 GMT

    [ https://issues.apache.org/jira/browse/PIG-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589586#action_12589586
] 

Mathieu Poumeyrol commented on PIG-206:
---------------------------------------

All,

A bit of background, first. Over the last four or five years, my team and I have implemented
several tools for our company [http://sk.idm.fr/opensource/software.html], one of them being
a data processing framework (skprod) including a language which shares many goals and characteristics
with Pig. 

The design was somehow "pre - MapReduce", and we are now facing some scalability issues which,
combined with the cost of maintenance, make Pig look like a very good candidate for a replacement
of skprod. The syntax is very different from what pig looks like, but the concepts maps quite
easily. You may want to have a look to a getting started paper [http://sk.idm.fr/opensource/doc/skprod/index.html]
to give an idea of it.

I am trying to "port" in Pig some of our existing data processing chains, and if many things
looks very good, I now get an overall feeling that there is a difference of granularity :
we were designing huge skprod script (the language itself has some builtin modularity) that
perform a full featured task end to end, and usualy try to avoid chaining skprod scripts.
But this approach does not map very well with pig as :
 - there is no way of defining "pig functions in pig". This lead obviously to pig code duplication.
 - every store statement is evaluated independently of the other, so there is no possibility
for a script to detect the existence of a reusable intermediary result.

This lead me to think that I should maybe use Pig to run very small tasks, and find (or build
?) something on top of it to drive my general process calling pig as many times as needed,
or generating a huge pig script...

At this point I'd realy like to know what you people think and where you plan to go... 

> Right granularity for a pig script
> ----------------------------------
>
>                 Key: PIG-206
>                 URL: https://issues.apache.org/jira/browse/PIG-206
>             Project: Pig
>          Issue Type: Wish
>            Reporter: Mathieu Poumeyrol
>
> I'd like to understand what people have in mind when they picture pig scripts...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message