hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: Issues with performance on Hadoop/Hive
Date Fri, 04 Sep 2009 12:42:51 GMT

On Sep 3, 2009, at 11:53 PM, Ramiya V wrote:

> Hi,
>
> Thanks Amandeep and Ashish!
>
> @Ashish: I have set the hive.metastore.warehouse.dir parameter as / 
> home/hive/warehouse. This warehouse directory is on the local  
> filesystem. So will the tables now get stored on the local  
> filesystem or HDFS? I mean do we need to specify explicitly if we  
> want the path to refer a location on the local filesystem?
>
> @Amandeep:  Actually I need to know how the data gets distributed  
> across the cluster. The master machine has 70GB free space and the 3  
> slaves have 140GB,50GB,140GB free space on Ubuntu. So when I load a  
> table on Hive with 45GB of data,how will it get distributed across 4  
> nodes? I mean since master has 70GB free space will it store all the  
> data on the master itself? (This I observed as when I loaded the  
> table, 45GB data was stored on master and some blocks were  
> replicated on the other 3 slaves)  I have set dfs.replication factor  
> as 2. I wanted to know how exactly Hadoop uses its intelligence to  
> utilize the free space effectively to store data on HDFS on a cluster,

Hey Ramya,

I believe this is thoroughly described in the system architecture  
document.  Let us know if there's something you feel is missing:

http://hadoop.apache.org/common/docs/current/hdfs_design.html

Brian

>
>
> -Ramya
> ________________________________________
> From: Ashish Thusoo [athusoo@facebook.com]
> Sent: Wednesday, September 02, 2009 11:02 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Issues with performance on Hadoop/Hive
>
> Hi Ramya,
>
> If you are using the hive-default.xml and have not overwritten the  
> hive.metastore.warehouse.dir parameter then by default the tables  
> will get placed in
>
> /user/hive/warehouse in hdfs.
>
> Ashish
>
> -----Original Message-----
> From: Amandeep Khurana [mailto:amansk@gmail.com]
> Sent: Wednesday, September 02, 2009 12:52 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Issues with performance on Hadoop/Hive
>
> Answers inline
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Tue, Sep 1, 2009 at 10:08 PM, Ramiya V  
> <Ramiya_V@persistent.co.in> wrote:
>
>> Hi,
>>
>> I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB
>> RAM each machine. Currently am using the sub-project Hive for firing
>> queries on 45GB of data. I have certain queries that need to be
>> resolved:-
>>
>> 1) The performance that I am getting with the above setup is quite
>> bad. It takes app 39 minutes for simple select query (with where
>> clause). I have set the mapred.map.tasks=13 and  
>> mapred.reduce.tasks=7.
>> Is this setting good enough for the above setup? Are there any
>> significant configuration parameters I need to set for getting a  
>> better performance on Hive?
>>
>>
> Check on the resource utilization. I think you shouldnt be running  
> more than
> 3 mappers + 1 reducer on each node at any time (given the hardware  
> you are using). But then that mostly depends on the amount of work  
> being done in the mappers and reducers.
>
> 2) Does anybody know how exactly the data on HDFS is distributed  
> across
>> nodes in a cluster? Also when we load the tables in Hive (by firing
>> Load command on master node),how and where is the data placed on HDFS
>> in a cluster?
>>
>
> Files are divided into blocks and the blocks are stored on the  
> Datanodes..
> Each block is 64MB by default. I'm not sure how the blocks are  
> distributed among the datanodes..
>
> 3) How and when does the data replication for HDFS take place in a  
> cluster?
>> Currently I have set the dfs replication factor=1. How does this
>> affect the performance?
>>
>
> Once you put the data into the hdfs, it starts replicating the blocks.
> However, the put is successful as soon as one block gets created...
>
>
>>
>> 3) Does adding a Virtual Machine to a physical machine cluster bring
>> about significant degradation in the performance?
>>
>
> Dont have numbers for this, but it does impact the performance.  
> Moreover, your hardware resources are low and there is really no  
> value add in using virtual machines on top of it.
>
>
>> Please let me know asap.
>>
>> Thanks,
>> Ramya
>>
>>
>>
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which
>> is the property of Persistent Systems Ltd. It is intended only for  
>> the
>> use of the individual or entity to which it is addressed. If you are
>> not the intended recipient, you are not authorized to read, retain,
>> copy, print, distribute or use this message. If you have received  
>> this
>> communication in error, please notify the sender and delete all  
>> copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus
>> infected mails.
>>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information  
> which is the property of Persistent Systems Ltd. It is intended only  
> for the use of the individual or entity to which it is addressed. If  
> you are not the intended recipient, you are not authorized to read,  
> retain, copy, print, distribute or use this message. If you have  
> received this communication in error, please notify the sender and  
> delete all copies of this message. Persistent Systems Ltd. does not  
> accept any liability for virus infected mails.


Mime
View raw message