hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Concatenate multiple sequence files into 1 big sequence file
Date Tue, 10 Sep 2013 15:07:10 GMT
Hi Hadoop users,

I have been trying to concatenate multiple sequence files into one.
Since the total size of the sequence files is quite big (1TB), I won't use
mapreduce because it requires 1TB in the reducer host to hold the temporary
data.

I ended up doing what have been suggested in this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201308.mbox/%3CCAOcnVr2CuBdNkXutyydGjw2td19HHYiMwo4=JUa=SrXi51717w@mail.gmail.com%3E

It works very well. I wonder if there is a faster way to append to a
sequence file.

Currently, the code looks like this (omit opening and closing sequence
files, exception handling etc):

// each seq is a sequence file
// writer is a sequence file writer
        for (val seq : seqs) {

          reader =new SequenceFile.Reader(conf, Reader.file(seq.getPath()));

            while (reader.next(readerKey, readerValue)) {

              writer.append(readerKey, readerValue);

            }

        }

Is there a better way to do this? Note that I think it is wasteful to
deserialize and serialize the key and value in the while loop because the
program simply append to the sequence file. Also, I don't seem to be able
to read and write fast enough (about 6MB/sec).

Any advice is appreciated,


Jerry

Mime
View raw message