hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: Issues with performance on Hadoop/Hive
Date Wed, 02 Sep 2009 07:51:33 GMT
Answers inline

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Tue, Sep 1, 2009 at 10:08 PM, Ramiya V <Ramiya_V@persistent.co.in> wrote:

> Hi,
> I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB RAM
> each machine. Currently am using the sub-project Hive for firing queries on
> 45GB of data. I have certain queries that need to be resolved:-
> 1) The performance that I am getting with the above setup is quite bad. It
> takes app 39 minutes for simple select query (with where clause). I have set
> the mapred.map.tasks=13 and mapred.reduce.tasks=7. Is this setting good
> enough for the above setup? Are there any significant configuration
> parameters I need to set for getting a better performance on Hive?
Check on the resource utilization. I think you shouldnt be running more than
3 mappers + 1 reducer on each node at any time (given the hardware you are
using). But then that mostly depends on the amount of work being done in the
mappers and reducers.

2) Does anybody know how exactly the data on HDFS is distributed across
> nodes in a cluster? Also when we load the tables in Hive (by firing Load
> command on master node),how and where is the data placed on HDFS in a
> cluster?

Files are divided into blocks and the blocks are stored on the Datanodes..
Each block is 64MB by default. I'm not sure how the blocks are distributed
among the datanodes..

3) How and when does the data replication for HDFS take place in a cluster?
> Currently I have set the dfs replication factor=1. How does this affect the
> performance?

Once you put the data into the hdfs, it starts replicating the blocks.
However, the put is successful as soon as one block gets created...

> 3) Does adding a Virtual Machine to a physical machine cluster bring about
> significant degradation in the performance?

Dont have numbers for this, but it does impact the performance. Moreover,
your hardware resources are low and there is really no value add in using
virtual machines on top of it.

> Please let me know asap.
> Thanks,
> Ramya
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message