apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devendra Tagare <devend...@datatorrent.com>
Subject Aligning FileSplitter and BlocReader with hadoop.mapreduce InputFormats
Date Wed, 23 Mar 2016 18:11:43 GMT
Hi All,

Initiating this thread to get the community's opinion on aligning the
FileSplitter with InputSplit & the BlockReader with the RecordReader from
org.apache.hadoop.mapreduce.InputSplit &
org.apache.hadoop.mapreduce.RecordReader respectively.

Some more details and rationale on the approach,

InputFormat lets MR create Input Splits ie individual chunks of bytes.
The ability to correctly create these splits is determined by the Input
Format itself.eg SequenceFile format or Avro.

Internally these formats are organized as a sequence of blocks.Each block
can be compressed with a compression codec & it does not matter if this
codec in itself is splittable.
When they are set as an Input format, the MR framework creates input splits
based on the block boundaries given by the metadata object packed with the

Each InputFormat has a specific block definition. eg for Avro the block
definition is as below,

Avro file data block consists of:

A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the
current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that
The file's 16-byte sync marker.
Thus, each block's binary data can be efficiently extracted or skipped
without deserializing the contents. The combination of block size, object
counts, and sync markers enable detection of corrupt blocks and help ensure
data integrity.

Each map task gets an entire block to read.RecordReader is used to read the
individual records for the block and generates key,val pairs.
The records could be fixed length or use a schema as in the case of parquet
or Avro.

We can extend the BlockReader to work with RecordReader based on the sync
markers to correctly identify & parse the individual records.

Please send across your thoughts on the same.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message