hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tamir Kamara (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-685) Distinct UDF progress reports
Date Sun, 29 Mar 2009 09:14:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693532#action_12693532
] 

Tamir Kamara edited comment on PIG-685 at 3/29/09 2:13 AM:
-----------------------------------------------------------

I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers
are failing because of failure to report for 600 seconds. There's also, a heap space error
on some mappers (same as before).

By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6) the mappers
are all finishing just fine, but the reducers are failing due to GC overhead exceeded. 
I'm running my tasks with 1024MB.


      was (Author: tamirk):
    I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers
are failing because of failure to report for 600 seconds. There's also, a heap space error
on some mappers (same as before).

By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6;) the mappers
are all finishing just fine, but the reducers are failing due to GC overhead exceeded. 
I'm running my tasks with 1024MB.

  
> Distinct UDF progress reports
> -----------------------------
>
>                 Key: PIG-685
>                 URL: https://issues.apache.org/jira/browse/PIG-685
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>         Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
>            Reporter: Tamir Kamara
>
> When using the DISTINCT function many of the map tasks are being killed because of failure
to report for 600 seconds. It seems that PIG-646 should have addressed this but I'm still
seeing many errors like this:
> 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map
output
> 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter
object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter
object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> My query:
> r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
> r3 = GROUP r0 BY org parallel 18;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group as org, COUNT(r6) as domains;
> }
> store r4 into 'org-domain-count';
> the source files are 21GB in total with some 800M lines, 60M distinct domains and 80K
distinct orgs. Some orgs have 50M domains in them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message