pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files
Date Mon, 03 Dec 2007 16:02:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547882
] 

Benjamin Reed commented on PIG-42:
----------------------------------

There are two reasons I use an empty file with a comment:

1) It allows me to test that a gzip file is infact splittable. We need to know up front that
we can split the gzip file. If the gzip isn't split at regular intervals, it's going to waste
a lot of time! The signature is more than a marker, it is meta-data that indicates that it
can be split. You will also notice that if you do 'head' on the file you can see that it is
splittable.

2) It gives you a much more reliable signature. (20 bytes instead of 4)

You can still use standard tools without using Pig:

cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> test.gz;
gzip -c test2 >> test.gz

You use standard gunzip to decompress. You can also easily find the split boundaries outside
of pig by looking for the signature.gz sequence.

This also allows you to better control the grouping. If your gzip file is bigger than 4G,
it will be a concatenation, so there may be time that you want to process concatenated gzip
files together without splitting. Using the empty signature file allows you to do that.

Now that I think about it more, it might also be good to reserve some bytes in the signature.gz
to put a block size. That way when can do intelligent splits when the fs blocksize doesn't
correspond to the gzip blocksize or the number of requested splits are very high.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately,
we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files
are concatenated together they are treated as a single file. So to make a gzipped file splittable
we can used an empty compressed file with some salt in the headers as a sync signature. Then
we can make the gzip file splittable by using this sync signature between compressed segments
of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message