hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-162) Rework mapreduce submission and monitoring
Date Wed, 07 May 2008 09:15:55 GMT

    [ https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594832#action_12594832
] 

Arun C Murthy commented on PIG-162:
-----------------------------------

Shravan, I'm not super-familiar with the new pipeline so pardon my temerity while I put forth
some thoughts on this patch... I'll confine it to Pig's usage of Map-Reduce:

1. Hadoop doesn't require Writables from 0.17.0 onwards: HADOOP-1986, so you could use that
as an advantage.
2. I agree with Alan about map-only jobs, just use something similar to PIG-196 (my unbiased
opinion *smile*).
3. RunnableReporter is a thread which blindly does 'reporting'. This makes it _very_ hard
to debug when applications go haywire. By this, you are going to miss a very important safety
net provided Hadoop Map-Reduce i.e. the ability to kill tasks which aren't 'progressing'.
*Please do not do this!* Ideally you should be using the *reporter* in the map/reduce functions
to report progress when tuples are being consumed.
4. The 'Slicer' notion is missing from PigInputFormat/PigSplit... are you planning to integrate
it later?
5. It's great that you are using Hadoop's jobcontrol, please let us know if anything was amiss
here: HADOOP- 2484.

----

Unrelated to this patch: I've felt the pain of jumping between Pig's notion of "Properties"
and Hadoop's Configuration/JobConf and worse, keeping them in-sync. This led to some obscure
bugs like PIG-230. Can you guys consider using Configuration/JobConf uniformly in both the
logical and physical layers? IMHO it will a huge maintence win... thoughts? Alan?

Similarly I don't see the value in JobControlCompiler transating between MROperPlan and JobControl,
for e.g.
{noformat}
+    public JobControl compile(MROperPlan plan, String grpName, Configuration conf, PigContext
pigContext) throws JobCreationException{
+        this.plan = plan;
+        this.conf = conf;
+        this.pigContext = pigContext;
+        JobControl jobCtrl = new JobControl(grpName);
+        
+        List<MapReduceOper> leaevs = new ArrayList<MapReduceOper>();
+        leaevs = plan.getLeaves();
+        
+        for (MapReduceOper mro : leaevs) {
+            jobCtrl.addJob(compile(mro,jobCtrl));
+        }
+        return jobCtrl;
+    }
{noformat}

The notion of MRCompiler and JobControlCompiler makes me a tad uneasy, is it an overkill?
Can we have one compiler and one visitor?
Should we use org.apache.hadoop.mapred.jobcontrol.Job more extensively?

I realise I'm air-brushing here, but yet ... 

> Rework mapreduce submission and monitoring
> ------------------------------------------
>
>                 Key: PIG-162
>                 URL: https://issues.apache.org/jira/browse/PIG-162
>             Project: Pig
>          Issue Type: Sub-task
>         Environment: This bug tracks works to rework the submission and monitoring interface
to map reduce as described in  http://wiki.apache.org/pig/PigTypesFunctionalSpec
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: mapreduceJumbo.patch, split.png, TEST-org.apache.pig.test.TestMRCompiler.txt,
TEST-org.apache.pig.test.TestUnion.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message