pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shravan Matthur Narayanamurthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-162) Rework mapreduce submission and monitoring
Date Wed, 07 May 2008 16:24:56 GMT

    [ https://issues.apache.org/jira/browse/PIG-162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594957#action_12594957

Shravan Matthur Narayanamurthy commented on PIG-162:

Thanks for reviewing the patch Arun. My responses in line..

1. Hadoop doesn't require Writables from 0.17.0 onwards: HADOOP-1986, so you could use that
as an advantage.
[shrav] This is great. However, in the types branch we are still at hadoop-15. We plan to
merge the changes in the main branch later and I think this is a good candidate to be taken
up then. I thought about it for a while. Currently, we do not have a umbrella class for our
types other than WritableComparable. Could not come up with a neat soln for this. Need to
think more on this. Certainly a good point and we need to spend more thime on this one.

2. I agree with Alan about map-only jobs, just use something similar to PIG-196 (my unbiased
opinion smile).
[shrav] I think doing something like PIG-196 would incur a branching in every call to the
map function checking whether it is a map-only job. This additional complexity is due to the
introduction of types. In the map-only jobs, we don't care about extracting the key &
indexed tuple. In a map-reduce job, we have to do the extraction. This is the branching I
wanted to avoid. I guess I gave a naive solution by duplicating code; one for map-only &
the other for map-reduce. I guess a better solution, as Alan suggested would be to subclass
both map-only & map-reduce Map classes and have an abstract collectKeyAndTuple function
which will be implemented in the map-only & map-reduce classes accordingly.

3. RunnableReporter is a thread which blindly does 'reporting'. This makes it very hard to
debug when applications go haywire. By this, you are going to miss a very important safety
net provided Hadoop Map-Reduce i.e. the ability to kill tasks which aren't 'progressing'.
Please do not do this! Ideally you should be using the reporter in the map/reduce functions
to report progress when tuples are being consumed.
[shrav] You are right. I will change that. Since this is a major change, I will do it once
this patch and Shubham's patch is in. I will write a proposal on the changes and submit it.

4. The 'Slicer' notion is missing from PigInputFormat/PigSplit... are you planning to integrate
it later?
[shrav] Yeah we have left it to the merging phase later

5. It's great that you are using Hadoop's jobcontrol, please let us know if anything was amiss
here: HADOOP- 2484.
[shrav] It works well. Probably some more documentation would be helpful.

Regarding Pig's notion of "Properties", are you referring to the backend and datastorage?
If so, I think we need to take this up during the merge of changes from the main branch

Regarding, creating a separate JobContorlCompiler, I did that because, I wanted to leave some
room for the optimizer to act. So once the MROperPlan is built, it can be optimized and then
 JobControlCompiler can work on the optimized plan to generate the Job Control.

> Rework mapreduce submission and monitoring
> ------------------------------------------
>                 Key: PIG-162
>                 URL: https://issues.apache.org/jira/browse/PIG-162
>             Project: Pig
>          Issue Type: Sub-task
>         Environment: This bug tracks works to rework the submission and monitoring interface
to map reduce as described in  http://wiki.apache.org/pig/PigTypesFunctionalSpec
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: mapreduceJumbo.patch, split.png, TEST-org.apache.pig.test.TestMRCompiler.txt,

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message