hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Retter <adam.ret...@googlemail.com>
Subject Re: Memory problems with BytesWritable and huge binary files
Date Fri, 24 Jan 2014 20:44:34 GMT
> Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

> If that isn't the
> case, I'm wondering how writing a single large key-value into a sequence
> file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

> If indeed they are a bunch of smaller sized key-value pairs, you can write
> your own custom InputFormat that reads the data from your input files one
> k-v pair after another, and feed it to your MR job. There isn't any need for
> converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

> Thanks
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.

Adam Retter

skype: adam.retter
tweet: adamretter

View raw message