hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <>
Subject Re: ORC file question
Date Mon, 10 Feb 2014 18:55:39 GMT
Hi Avrilia

Is it a partitioned table? If so approximately how many partitions are there and how many
files are there? What is the value for hive.input.format?

My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had
mentioned you are using default stripe size i.e, 256MB, the default HDFS block size for ORC
files will be chose as 512MB. When a query is issued, the input files are split on HDFS block
boundaries. So if the file size in a partition is 515MB there will be 2 splits per file (512MB
on HDFS block boundary + remaining 3MB). This happens when the input format is set to HiveInputFormat.

Prasanth Jayachandran

On Feb 10, 2014, at 12:49 AM, Avrilia Floratou <> wrote:

> Hi all,
> I'm running a query that scans a file stored in ORC format and extracts some columns.
My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates
363 map tasks. 
> I have noticed that the first 180 map tasks finish in 3 secs (each) and after they complete
the HDFS_BYTES_READ counter is equal to about 3MB. Then the remaining map tasks are the ones
that scan the data and each one completes in about 20 sec. It seems that each of these map
tasks gets as input 512 MB of the file. I was wondering, what exactly are the first short
map tasks doing?
> Thanks,
> Avrilia

NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

View raw message