hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Check compression codec of an HDFS file
Date Thu, 05 Dec 2013 10:22:20 GMT
If you're looking for file header/contents based inspection, you could
download the file and run the Linux utility 'file' on the file, and it
should tell you the format.

I don't know about Snappy (AFAIK, we don't have a snappy
frame/container format support in Hadoop yet, although upstream Snappy
issue 34 seems resolved now), but Gzip files can be identified simply
by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to
read its first few hundred bytes, which should have the codec string
in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
for sequence files.

On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <alexjbohr@gmail.com> wrote:
> What's the best way to check the compression codec that an HDFS file was
> written with?
>
> We use both Gzip and Snappy compression so I want a way to determine how a
> specific file is compressed.
>
> The closest I found is the getCodec but that relies on the file name suffix
> ... which don't exist since Reducers typically don't add a suffix to the
> filenames they create.
>
> Thanks



-- 
Harsh J

Mime
View raw message