hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Date Tue, 10 Aug 2010 18:37:16 GMT

    [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896993#action_12896993
] 

Alan Gates commented on PIG-1501:
---------------------------------

It's not surprising that RCFile performs badly here, since in every case every column in the
row is used.  This is known to be a bad use case for columnar storage.  While for some data
sets the better compression may overcome this, I suspect that in the general case the stitching
costs will overwhelm any compression wins (as shown here).

I'm +1 with going with lzo/Tfile.  As the lzo libs are GPL we cannot ship with that as default.
 I wasn't clear from your last comment which you were proposing as the default.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message