pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Warrington" <awarr...@gmail.com>
Subject Re: Review Request: PIG-1702. Fix for task output logs for streaming jobs containing null input-split information.
Date Thu, 19 May 2011 16:36:21 GMT


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java,
line 205
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line205>
> >
> >     please clean up whitespace :)

Oops, sorry. I'll clean that up.


> On 2011-04-13 18:03:22, Dmitriy Ryaboy wrote:
> > trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java,
line 202
> > <https://reviews.apache.org/r/547/diff/1/?file=14980#file14980line202>
> >
> >     Do we care about the specifics of how this output is written?
> >     
> >     Seems like it would be less code, and potentially better in the long run (if
we are dealing with other kinds of splits) to just call toString() on the InputSplit. FileSplit
already defines its own toString() which prints out the path, the start offset, and the length.
> 
> Ashutosh Chauhan wrote:
>     I agree with Dmitriy. If possible, we should avoid special casing for a particular
type of InputSplit. Further, InputSplit provides getLocations() and getLength() api which
should be used instead of FileSplit specific api.

So it seems the options are to either:

1. Use the input splits toString() method.
2. Use just getLocations and getLength, which are part of the InputSplit API.

I'm leaning towards toString, because it is going to contain useful information for the common
case of FIleSplit which getLocations won't have, that being the file offset and the file name.

If this is the common consensus, I'll submit a patch with that update. Let me know.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/547/#review452
-----------------------------------------------------------


On 2011-05-19 16:27:22, Adam Warrington wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/547/
> -----------------------------------------------------------
> 
> (Updated 2011-05-19 16:27:22)
> 
> 
> Review request for pig.
> 
> 
> Summary
> -------
> 
> This is a patch for PIG-1702, which describes an issue where the task output logs for
PIG streaming jobs contains null input-split information. The ability to query the input-split
information through the JobConf went away with the new MR API. We must now gain a reference
to the underlying FiletSplit, and query this reference for that information.
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java 1088692

> 
> Diff: https://reviews.apache.org/r/547/diff
> 
> 
> Testing
> -------
> 
> To test this, I wrote a very simple python script to pass data through using PIG. After
checking the task logs of the completed task, the stderr logs now contain valid input split
information. Below are the scripts and test data used.
> 
> ### PIG commands run ###
> DEFINE testpy `test.py` SHIP ('test.py');
> raw_records = LOAD '/test.txt2'; 
> T1 = STREAM raw_records THROUGH testpy;
> dump T1;
> 
> ### test.py ###
> #!/usr/bin/python
> import sys
> 
> cnt = 0
> for line in sys.stdin:
>     print line.strip() + " " + str(cnt)
>     cnt += 1
> 
> ### contents of /test.txt on hdfs ###
> one line
> two line
> three line
> four line
> 
> 
> Thanks,
> 
> Adam
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message