pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ying He (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-979) Acummulator Interface for UDFs
Date Thu, 12 Nov 2009 00:18:48 GMT

    [ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776760#action_12776760

Ying He commented on PIG-979:

performance tests doesn't show noticeable difference between trunk and accumulator patch when
calling no-accumulator udfs.

the script to test performance is:

register /homes/yinghe/pig_test/pigperf.jar;
register /homes/yinghe/pig_test/string.jar;
register /homes/yinghe/pig_test/piggybank.jar;

A = load '/user/pig/tests/data/pigmix_large/page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info,

B = foreach A generate user, org.apache.pig.piggybank.evaluation.string.STRINGCAT(user, ip_addr)
as id;

C = group B by id parallel 10;

D = foreach C {
    generate group, string.BagCount2(B)*string.ColumnLen2(B, 0);

store D into 'test2';

The input data has 100M rows, output has 57M rows, so the UDFs are called 57M times.
The result is

 with patch:  5min 14sec
 w/o patch:   5min 17sec

> Acummulator Interface for UDFs
> ------------------------------
>                 Key: PIG-979
>                 URL: https://issues.apache.org/jira/browse/PIG-979
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>            Assignee: Ying He
>         Attachments: PIG-979.patch, PIG-979.patch
> Add an accumulator interface for UDFs that would allow them to take a set number of records
at a time instead of the entire bag.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message