hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Pullara (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files
Date Sat, 01 Dec 2007 23:01:43 GMT

    [ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547528
] 

Sam Pullara commented on PIG-42:
--------------------------------

Is there any reason you decided not to use the gzip ID instead of empty files?  It seems like
it would be better if people could generate these files themselves easily without using PIG
at all.  Each gzip file will start with "1F 8B 08 08" [1] if you use this mechanism to create
them:

gzip -c test1 test2 > test.gz     [2]

In the few times that it is wrong you will get an exception from your gzip stream and you
can try again at the next boundary.

[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] man gzip



> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. Unfortunately,
we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When gzipped files
are concatenated together they are treated as a single file. So to make a gzipped file splittable
we can used an empty compressed file with some salt in the headers as a sync signature. Then
we can make the gzip file splittable by using this sync signature between compressed segments
of the file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message