hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Katzenellenbogen <mich...@cloudera.com>
Subject Re: Confused about splitting
Date Sun, 10 Feb 2013 15:58:28 GMT
Have your record reader seek (and discard data) until it finds the
beginning of your frame, and always keep reading until the end/length of
the current frame/record. This will assure you that if your PB record is
split between blocks that you will read them all.

For example, suppose you end up with 2 mappers, and 3 PB records across 2
HDFS splits, then mapper 1 will read one record immediately (it sees the
frame market right away), and continue reading a second record up until its
end / length -- even into the next block. Mapper 2 will begin reading its
block, and begin looking for a frame marker, discarding data up until the
marker; effectively reading in one block once it finds the frame marker.
Between both mappers you've now read all 3 blocks.


On Feb 10, 2013, at 10:36 AM, Christopher Piggott <cpiggott@gmail.com>

I'm a little confused about splitting and readers.

The data in my application is stored in files of google protocol buffers.
 There are multiple protocol buffers per file.  There have been a number of
simple ways to put multiple protobufs in a single file, usually involving
writing some kind of length field before.  We did something a little more
complicated by defining a frame similar to HDLC: frames are enveloped by a
flag, escapes provided so the flag can't occur within the frame; and there
is a 32-bit CRC-like checksum just before the closing flag.

The protobufs are all a type named RitRecord, and we have our own reader
that's something like this:

   public interface RitRecordReader {
      RitRecord getNext();

The data collection appication stores these things in ordinary flat files
(the whole thing is run through a GzipOutputFilter first, so the files are
compressed).  I'm having trouble understanding how to best apply this to
HDFS for map function consumption.  Our data collector writes 1 megabyte
files, but I can combine them for map/reduce performance.  To avoid TOO
much wasted space I was thinking about 16, 32, or 64 MB HDFS blocks (tbd).

What I don't get is this: suppose we have a long file that spans multiple
HDFS blocks.  I think I end up with problems similar to this guy:


where one of my RitRecord objects is half in one HDFS block and half in
another HDFS block.  If the mapper is assigning tasks to nodes along HDFS
blocks then I'm going to end up with a problem.  It's not yet clear to me
how to solve this.  I could make the problem LESS likely with bigger blocks
(like the default 128MB) but even then, the problem doesn't completely go
away (for me, a >128MB file is unlikely but not impossible).


View raw message