hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Overby (groverby)" <>
Subject Re: ORC file across multiple HDFS blocks
Date Tue, 28 Apr 2015 15:34:00 GMT
Expanding on Alan’s post:

Files are intended to span many blocks and a single file may be read by many mappers. In order
for a file to be read by many mappers, it goes through a process called input splits which
splits the input around hdfs block boundaries.

If a unit of data within a file crosses a hdfs block, a portion of that unit of data must
be sent from the node which contains block/mapper of one portion to the node that contains
the block/mapper of the other portion. Take a csv file for example, in this case a unit of
data is a line, and transferring a portion of a line between boxes is no big deal.

This changes a bit for orc files as the unit of data is a stripe. An orc stripe is typically
a few hundred MB. Without some additional logic, a substantial part of data locality would
be lost; however, orc has such additional logic. The stripe size of the orc file should be
set a few MB below the hdfs block size and padding enable to produce a 1:1 relationship between
an orc stripe and an hdfs block. How many stripes or blocks that are “in" a single file
is of no consequence so long as this 1:1 relationship is maintained.

Below is an example config for 128mb hdfs blocks.

            Configuration writerConf = new Configuration();
// other config
            OrcFile.WriterOptions writerOptions = OrcFile.writerOptions(writerConf);
            writerOptions.stripeSize(122 * 1024 * 1024);
// other options
            Writer writer = OrcFile.createWriter(path, writerOptions);


Grant Overby
Software Engineer<><>
Mobile: 865 724 4910

[] Think before you print.

This email may contain confidential and privileged material for the sole use of the intended
recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If
you are not the intended recipient (or authorized to receive for the recipient), please contact
the sender by reply email and delete all copies of this message.

Please click here<>
for Company Registration Information.

From: Alan Gates <<>>
Reply-To: "<>" <<>>
Date: Monday, April 27, 2015 at 2:05 PM
To: "<>" <<>>
Subject: Re: ORC file across multiple HDFS blocks

to cross blocks and hence n

View raw message