pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Gates (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Date Fri, 16 Jul 2010 00:24:50 GMT

    [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888972#action_12888972

Alan Gates commented on PIG-1501:

Enabling compression directly on BinStorage as is will be bad.  bzip is splittable but very
slow, and gzip isn't splittable.

To do this we need to look at using SequenceFiles for moving data between MR jobs.  We can
have a null key and value type of Tuple and use SequenceFileInput/OutputFormat.  This will
enable us to use the block level compression in sequence files.  For now we can continue with
the same serialization used in BinStorage, though in the future we may want to change this
as well.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message