hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <>
Subject Re: Optimising mappers for number of nodes
Date Mon, 03 Feb 2014 18:59:30 GMT

hive.max.split.size can be tuned to decrease the number of mappers. Reference:
(slide number 38)

Also using CombineHiveInputFormat (default input format) will combine multiple small files
to form a large split and hence less number of mappers.

Prasanth Jayachandran

On Feb 3, 2014, at 10:20 AM, KingDavies <> wrote:

> Our platform has a 40GB raw data file that was compressed lzo (12GB compressed) to reduce
network IO between S3.
> Without indexing the file is unsplittable resulting in 1 map task and poor cluster utilisation.
> After indexing the file to be splitable the hive query produces 120 map tasks.
> However, with the 120 tasks distributed over a small 4 node cluster it takes longer to
process the data than when it wasn’t splitable and processing done by a single node (1h20mins
vs 17mins). This was with a fairly simple select from where query, without distinct, group
by or order.
> I’d like to utilise all nodes in the cluster, to reduce query time. Whats the best
way to have the data crunched in parallel but with fewer mappers?

NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

View raw message