hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avrilia Floratou <>
Subject RCFile and Hadoop Counters
Date Tue, 31 Jan 2012 23:53:15 GMT

I have a question related to the hadoop counters when RCFile is used.
I have 16TB of (uncompressed) data stored in compressed RCFile format. The size of the compressed
RCFile is approximately 3 TB.
I ran a simple scan query on this table. Each split is 256 MB (HDFS block size). 

From the counters of each individual map task I can see the following info:

HDFS_BYTES_READ : 91,235,561
Map input bytes: 268,191,006

Then I looked at the aggregate counters produced by the MR job. I see:

HDFS_BYTES_READ :  1,049,781,904,232
Map input bytes:  3,088,881,678,946

The total job time is 4980 sec. During the job I was running iostat to check the bw I was
getting from my disks and that was 40 MB/sec at each of my 16
nodes. That means a total of 40*16 = 640 MB/sec across the cluster.

If the raw data read was 1,049,781,904,232 according to the HDFS_BYTES_READ counter then the
job would finish in 1640 sec (1TB/ 640mb/sec).
What is wrong here?

I'm actually wondering what these two counters HDFS_BYTES_READ and Map Input Bytes actually
represent when compressed RCFiles are used 
as a storage layer and how these are related to the raw bandwidth I can get from iostat.

View raw message