hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1426) Change the size of Tuple from Int to VInt when Serialize Tuple
Date Mon, 24 May 2010 05:24:23 GMT

    [ https://issues.apache.org/jira/browse/PIG-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870508#action_12870508
] 

Jeff Zhang commented on PIG-1426:
---------------------------------

I did a simple experiment for the performance comparison.
This is the pig script I used
{code}
a = load '/input';
b = foreach a generate $0,$1;
c = group b by $0 PARALLEL 2;
result = foreach c generate group,SUM(b.$1);
dump result;
{code}

And the following is the result
|| ||Using Int||Using VInt||
|Mapper Output|3,288,892,896|2,688,892,896|
|Time cost for the  pig script|12mins, 23sec|12mins, 1sec| 


I haven't did a complete comparison of PigMix, but I believed it will improve the performance.


> Change the size of Tuple from Int to VInt when Serialize Tuple
> --------------------------------------------------------------
>
>                 Key: PIG-1426
>                 URL: https://issues.apache.org/jira/browse/PIG-1426
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>    Affects Versions: 0.8.0
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>             Fix For: 0.8.0
>
>         Attachments: PIG_1426.patch
>
>
> Most of  time,  the size of tuple is not very large, one byte is enough for store the
size of tuple. So I suggest to use VInt instead of Int for the size of tuple when doing Serialization.
Because the key type of map output is Tuple, so this can reduce the amount of data transferred
from mapper to reducer. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message