hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: *.gz input files
Date Thu, 04 Jun 2009 14:57:08 GMT

If you're case is like mine, where you have lots of .gz files and you
don't want splits in the middle of those files, you can use the code I
just sent in the thread about traversing subdirectories.  In brief, your
RecordReader could do something like:

    public static class MyRecordReader 
        implements RecordReader<DocLocation, Text> {
        private CompressionCodecFactory compressionCodecs = null;
        private long start;
        private long end;
        private long pos;
        private Path file;
        private LineRecordReader.LineReader in;
        public MyRecordReader(JobConf job, FileSplit split)
            throws IOException {
            file = split.getPath();
            start = 0;
            end = split.getLength();
            compressionCodecs = new CompressionCodecFactory(job);
            CompressionCodec codec = compressionCodecs.getCodec(file);

            FileSystem fs = file.getFileSystem(job);
            FSDataInputStream fileIn = fs.open(file);

            if (codec != null) {
                in = new LineRecordReader.LineReader(codec.createInputStream(fil
eIn), job);
            } else {
                in = new LineRecordReader.LineReader(fileIn, job);
            pos = 0;

Alex Loddengaard <alex@cloudera.com> writes:

> Hi Adam,
> Gzipped files don't play that nicely with Hadoop, because they aren't
> splittable.  Can you use bzip2 instead?  bzip2 files play more nicely with
> Hadoop, because they're splittable.  If you're stuck with gzip, then take a
> look here: <http://issues.apache.org/jira/browse/HADOOP-437>.  I don't know
> if you'll have to set the same JobConf parameter in newer versions of
> Hadoop, but it's worth trying out.
> Hope this helps.
> Alex
> On Wed, Jun 3, 2009 at 11:50 AM, Adam Silberstein <silberst@yahoo-inc.com>wrote:
>> Hi,
>> I have some hadoop code that works properly when the input files are not
>> compressed, but it is not working for the gzipped versions of those
>> files.  My files are named with *.gz, but the format is not being
>> recognized.  I'm under the impression I don't need to set any JobConf
>> parameters to indicate compressed input.
>> I'm actually taking a directory name as input, and modeled that aspect
>> of my application after the MultiFileWordCount.java example in
>> org.apache.hadoop.examples.  Not sure if this is part of the problem.
>> Thanks,
>> Adam

View raw message