hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tamir Kamara (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-685) Distinct UDF progress reports
Date Sun, 29 Mar 2009 09:14:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693532#action_12693532
] 

Tamir Kamara commented on PIG-685:
----------------------------------

I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers
are failing because of failure to report for 600 seconds. There's also, a heap space error
on some mappers (same as before).

By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6;) the mappers
are all finishing just fine, but the reducers are failing due to GC overhead exceeded. 
I'm running my tasks with 1024MB.


> Distinct UDF progress reports
> -----------------------------
>
>                 Key: PIG-685
>                 URL: https://issues.apache.org/jira/browse/PIG-685
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>         Environment: Hadoop 0.18.3 on redhat, PIG svn from feb-01
>            Reporter: Tamir Kamara
>
> When using the DISTINCT function many of the map tasks are being killed because of failure
to report for 600 seconds. It seems that PIG-646 should have addressed this but I'm still
seeing many errors like this:
> 2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map
output
> 2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter
object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> 2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter
object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
> My query:
> r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
> r3 = GROUP r0 BY org parallel 18;
> r4 = FOREACH r3 {
>        r5 = r0.domain;
>        r6 = distinct r5;
>        GENERATE group as org, COUNT(r6) as domains;
> }
> store r4 into 'org-domain-count';
> the source files are 21GB in total with some 800M lines, 60M distinct domains and 80K
distinct orgs. Some orgs have 50M domains in them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message