hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Ding (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
Date Wed, 16 Dec 2009 00:58:18 GMT

    [ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791101#action_12791101

Richard Ding commented on PIG-1110:

bq. 1. If you worry about the API compatibility of PigStorage() since PigStorage() is the
default LoadFunc of Pig, there's another option that we can provide another LoadFunc having
the ability of compression, I mean we can create a new LoadFunc such as Bz2PigStorage().

I like this idea better. The Bz2PigStorage extends PigStorage and just set the Hadoop compressor
in its constructor. If PigStorage is used, then the file extension determines the codec.

bq. 2. Actually the file name in Store statement is the folder name not the file name, we
will get part-00000.bz2 under this folder. The part-00000.bz2 is the real file which is consumed
by hadoop. Hadoop will check the file name rather the folder name to determine the compression

You're right. But if you copy a .bz file from local file system to hdfs, then it won't be
recognized as a bzip file by hadoop TextInputFormat. The problem is that hadoop doesn't read
header to determine the file type, but rely on the file extension.

> Handle compressed file formats -- Gz, BZip with the new proposal
> ----------------------------------------------------------------
>                 Key: PIG-1110
>                 URL: https://issues.apache.org/jira/browse/PIG-1110
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>         Attachments: PIG-1110.patch, PIG_1110_Jeff.patch

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message