hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files
Date Fri, 07 Dec 2007 18:01:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549494

Benjamin Reed commented on PIG-42:

The patch is not ready to commit yet. It's a work in progress patch. I talked to Utkarash
about this and it's missing a termination of the split. Currently each split will not terminate
correctly.There is a termination hook that bzip uses that I need to latch into.

Basically here are the things I need to add to finish:

1) Terminate split processing correctly
2) Add test cases
3) Encode block size as part of the header so that we can get almost "perfect" splits. (For
example a file that is compressed as 128M blocks should not be split on 64M boundaries even
if the block size of the filesystem is 128M.)

I'll try to get a committable patch this weekend.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately,
we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files
are concatenated together they are treated as a single file. So to make a gzipped file splittable
we can used an empty compressed file with some salt in the headers as a sync signature. Then
we can make the gzip file splittable by using this sync signature between compressed segments
of the file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message