pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Mokashi (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2629) Wrong Usage of Scalar which is null causes high namenode operation
Date Sun, 08 Apr 2012 07:39:51 GMT

    [ https://issues.apache.org/jira/browse/PIG-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249495#comment-13249495
] 

Aniket Mokashi commented on PIG-2629:
-------------------------------------

I think we should make scalar syntax more explicit to avoid such problems. D->to or SCALAR(D.to)
are few options.
                
> Wrong Usage of Scalar which is null causes high namenode operation 
> -------------------------------------------------------------------
>
>                 Key: PIG-2629
>                 URL: https://issues.apache.org/jira/browse/PIG-2629
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.2, 0.10.0
>            Reporter: Anitha Raju
>
> Hi,
> Script
> {code}
> A = LOAD 'test3.txt'   AS (from:chararray);
> B = LOAD 'test2.txt'    AS (source:chararray,to:chararray);
> C = FILTER A BY (from == 'temp' );
> D = FILTER B BY (source MATCHES '.*xyz*.');
> E = JOIN C by (from) left outer,D by (to);
> F = FILTER E BY (D.to IS NULL);
> dump F;
> {code}
> Inputs
> {code}
> $ cat test2.txt
> temp    temp
> temp    temp
> temp    temp
> temp    temp
> temp    temp
> tepm    tepm
> $ cat test3.txt  |head
> temp
> temp
> temp
> temp
> temp
> temp
> tepm
> temp
> temp
> temp
> {code}
> Here I have by mistake called 'to' using 'D.to' instead of 'D::to'. The D relation gives
null output. 
> First Map Reduce job computes D which give null results.
> The MapPlan of 2nd job
> {code}
> Union[tuple] - scope-56
> |
> |---E: Local Rearrange[tuple]{chararray}(false) - scope-36
> |   |   |
> |   |   Project[chararray][0] - scope-37
> |   |
> |   |---C: Filter[bag] - scope-26
> |       |   |
> |       |   Equal To[boolean] - scope-29
> |       |   |
> |       |   |---Project[chararray][0] - scope-27
> |       |   |
> |       |   |---Constant(temp) - scope-28
> |       |
> |       |---A: New For Each(false)[bag] - scope-25
> |           |   |
> |           |   Cast[chararray] - scope-23
> |           |   |
> |           |   |---Project[bytearray][0] - scope-22
> |           |
> |           |---F: Filter[bag] - scope-17
> |               |   |
> |               |   POIsNull[boolean] - scope-21
> |               |   |
> |               |   |---POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[chararray]
- scope-20
> |               |       |
> |               |       |---Constant(1) - scope-18
> |               |       |
> |               |       |---Constant(hdfs://nn-nn1/tmp/temp-1607149525/tmp281350188)
- scope-19
> |               |
> |               |---A: Load(hdfs://nn-nn1/user/anithar/test3.txt:org.apache.pig.builtin.PigStorage)
- scope-0
> |
> |---E: Local Rearrange[tuple]{chararray}(false) - scope-38
>     |   |
>     |   Project[chararray][1] - scope-39
>     |
>     |---Load(hdfs://nn-nn1/tmp/temp-1607149525/tmp-458164144:org.apache.pig.impl.io.TFileStorage)
- scope-53--------
> {code}
> Here at F , the file /tmp/temp-1607149525/tmp281350188 which is the output of the 1st
Mapreduce Job is repeatedly read. 
> If the input to F was non empty, since I am calling the scalar wrongly, it would have
failed with the expected error message 'Scalar has more than 1 row in the output'.
> But since its null, it returns in ReadScalars before the exception is thrown and gives
these in the task logs repeatedly 
> {code}
> 2012-04-03 11:46:58,824 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total
input paths to process : 1
> 2012-04-03 11:46:58,824 INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil:
Total input paths to process : 1
> 2012-04-03 11:46:58,827 WARN org.apache.pig.impl.builtin.ReadScalars: No scalar field
to read, returning null
> ....
> ....
> {code}
> That is its reading the '/tmp/temp-1607149525/tmp281350188' file again and again which
was causing high namenode operation. 
> The cost of one small mistake had ended up causing heavy namenode operations.
> Regards,
> Anitha

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message