hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Date Tue, 10 Aug 2010 19:12:24 GMT

    [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897005#action_12897005
] 

Yan Zhou commented on PIG-1501:
-------------------------------

The default is *not* using the compression on the intermediate data, which is the existing
behavoir.

For RC file, it is just a bit better in terms of compression ration  than TFile. In terms
of performance, the difference is within background noise. Stitching costs should be minimal.
Actually, the full "projection" is the biggest advantage of RCFile over other columnar storage
like  zebra. I was surprised to see the compression improvement over TFile is marginal. The
only cause I can think of is that the compression ratio is too sensitive to the data to pre-determine
or even pre-estimate.

lzo is under GPL. But it appears that Hadoop installation has it, at least in my test cluster.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message