hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Date Thu, 29 Jul 2010 18:06:17 GMT

    [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746

Yan Zhou commented on PIG-1501:

gzip and lzo2 are tried as the compression codecs;  TFile and RCFile are used as storage formats.
The tests are PigMix's L3 and L11, and a variation of L3 with full projection, hereafter referred
as L3_1,  in order to expand the temporary data size. (In some cases, multiple runs are executed,
particularly in presence of doubted system fluctuations.)  End-to-end elapsed times are recorded.

The results are on a 15-node cluster of  2 x Xeon L5420 2.50GHz/16G RAM boxes:

          uncompressed                TFile(lzo)                  TFile(gzip)          RCFile(lzo2)
L3        133684504                   19674398                 11513958            18092681
                 1'40"                              1'45"                           1'40"

L3_1    3889095541              3697681875            2637742581         3675818160
                 3'10"                               4'4"                            3'25"

L11       25878480                   21368784                 15233146             21112892
                 1'52"                             1'52"                          1'57"  

A few observations are in order:

1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store. But the actual
difference is marginal. It is probably because of L11's unique values, and many of  L3_1's
random values like time stamp, plus the presence of map-typed columns. The conclusion from
this observation is that compression of temporary intermediate data is not guaranteed to save
disk space to a desired degree. It's subject to temporary data values being compressed upon.
As result, this feature should be made configurable;
4)  The performance implications from these tests seem to be negligible within background
noise or within a few percentages of the overall run times. But this is not conclusive yet.
Larger and more real life queries would be more suitable for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar compression ratio.
Bu this observation could be data-sensitive.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message