hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Memory problems with BytesWritable and huge binary files
Date Fri, 24 Jan 2014 18:05:52 GMT
Hi there,

We have several diverse large datasets to process (one set may be as
much as 27 TB), however all of the files in these datasets are binary
files. We need to be able to pass each binary file to several tools
running in the Map Reduce framework.
We already have a working pipeline of MapReduce tasks that receives
each binary file (as BytesWritable) and processes it, we have tested
it with very small test datasets so far.

For any particular data set, the size of the files involves varies
wildly with each file being anywhere between about 2 KB and 4 GB. With
that in mind we have tried to follow the advice to read the files into
a Sequence File in HDFS. To create the Sequence File we have a Map
Reduce Job that uses a SequenceFileOutputFormat[Text, BytesWritable].

We cannot split these files into chunks, they must be processed by our
tools in our mappers and reducers as complete files. The problem we
have is that BytesWritable appears to load the entire content of a
file into memory, and now that we are trying to process our production
size datasets, once you get a couple of large files on the go, the JVM
throws the dreaded OutOfMemoryError.

What we need is someway to process these binary files, by reading and
writing their contents as Streams to and from the Sequence File. Or
really any other mechanism that does not involve loading the entire
file into RAM! Our own tools that we use in the mappers and reducers
in-fact expect to work with java.io.InputStream. We have tried quite a
few things now, including writing some custom Writable
implementations, but we then end up buffering data in temporary files
which is not exactly ideal when the data already exists in the
sequence files in HDFS.

Is there any hope?

Thanks Adam.

Adam Retter

skype: adam.retter
tweet: adamretter

View raw message