hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files
Date Mon, 08 Sep 2008 14:04:44 GMT

    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629166#action_12629166

Tom White commented on PIG-42:

It would be nice if the format could be generated using standard tools. By modifying the gzip
flag header so that it refers to the file name (which the gzip tool can set), rather than
a comment (which it cannot) we can generate compatible files using the following:

touch -mt 197007130719.25 Split
gzip -c Split file1 Split file2 > file.gz

Then the first split file has the following hexdump:
hexdump -n 26 -C file.gz
00000000  1f 8b 08 08 6d ca fe 00  00 03 53 70 6c 69 74 00  |....m.....Split.|
00000010  03 00 00 00 00 00 00 00  00 00                    |..........|

Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be OK as the
code doesn't use it when searching for the signature.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>            Assignee: Benjamin Reed
>         Attachments: gzip.patch
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately,
we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files
are concatenated together they are treated as a single file. So to make a gzipped file splittable
we can used an empty compressed file with some salt in the headers as a sync signature. Then
we can make the gzip file splittable by using this sync signature between compressed segments
of the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message