hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raju Chinthala <>
Subject Auto scaling on ec2
Date Thu, 10 Jul 2014 10:06:59 GMT
Scaling a Hadoop cluster with Hive has the following issues

1. Adding a computing node(Scaling up) when load on the cluster is high
decreases the execution time of the queries but its there is still a huge
time lag as the new node works on data from other nodes.

2. The process of removing a node from the cluster(Scaling down) when load
on the cluster is low, is also time consuming.

To reduce the time to scale the Hadoop cluster, we came up with the
following solution.

Prior to adding a new node, move the data from the existing nodes to the
new node. This balances the cluster and if a new task comes, the newly
added node can take it up as it already has the data (data locality).

When decommissioning a node, move the data available on that node to the
other nodes in the cluster., then decommission it.
We tested this with hive on hadoop on 5 node cluster,

*Time taken for Hive query,*

*4node cluster*

*Existing procedure(added new node) 5node cluster*

*New procedure(added new node) 5node cluster*




check the results and the approach here

Any drawbacks/suggestions on this approach, we would like to hear from you..

Thanks & Regards,
Raju Chinthala

View raw message