hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Thusoo <athu...@facebook.com>
Subject RE: Issues with performance on Hadoop/Hive
Date Fri, 04 Sep 2009 17:49:31 GMT
Hi Ramya,

Yes you have to explicitly give the hdfs path, so

Hdfs://<namnode>:<port>/home/hive/warehouse

in case you want to keep the same path in hdfs should work.

Ashish

-----Original Message-----
From: Brian Bockelman [mailto:bbockelm@cse.unl.edu] 
Sent: Friday, September 04, 2009 5:43 AM
To: common-user@hadoop.apache.org
Subject: Re: Issues with performance on Hadoop/Hive


On Sep 3, 2009, at 11:53 PM, Ramiya V wrote:

> Hi,
>
> Thanks Amandeep and Ashish!
>
> @Ashish: I have set the hive.metastore.warehouse.dir parameter as / 
> home/hive/warehouse. This warehouse directory is on the local 
> filesystem. So will the tables now get stored on the local filesystem 
> or HDFS? I mean do we need to specify explicitly if we want the path 
> to refer a location on the local filesystem?
>
> @Amandeep:  Actually I need to know how the data gets distributed 
> across the cluster. The master machine has 70GB free space and the 3 
> slaves have 140GB,50GB,140GB free space on Ubuntu. So when I load a 
> table on Hive with 45GB of data,how will it get distributed across 4 
> nodes? I mean since master has 70GB free space will it store all the 
> data on the master itself? (This I observed as when I loaded the 
> table, 45GB data was stored on master and some blocks were replicated 
> on the other 3 slaves)  I have set dfs.replication factor as 2. I 
> wanted to know how exactly Hadoop uses its intelligence to utilize the 
> free space effectively to store data on HDFS on a cluster,

Hey Ramya,

I believe this is thoroughly described in the system architecture document.  Let us know if
there's something you feel is missing:

http://hadoop.apache.org/common/docs/current/hdfs_design.html

Brian

>
>
> -Ramya
> ________________________________________
> From: Ashish Thusoo [athusoo@facebook.com]
> Sent: Wednesday, September 02, 2009 11:02 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Issues with performance on Hadoop/Hive
>
> Hi Ramya,
>
> If you are using the hive-default.xml and have not overwritten the 
> hive.metastore.warehouse.dir parameter then by default the tables will 
> get placed in
>
> /user/hive/warehouse in hdfs.
>
> Ashish
>
> -----Original Message-----
> From: Amandeep Khurana [mailto:amansk@gmail.com]
> Sent: Wednesday, September 02, 2009 12:52 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Issues with performance on Hadoop/Hive
>
> Answers inline
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Tue, Sep 1, 2009 at 10:08 PM, Ramiya V <Ramiya_V@persistent.co.in> 
> wrote:
>
>> Hi,
>>
>> I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB 
>> RAM each machine. Currently am using the sub-project Hive for firing 
>> queries on 45GB of data. I have certain queries that need to be
>> resolved:-
>>
>> 1) The performance that I am getting with the above setup is quite 
>> bad. It takes app 39 minutes for simple select query (with where 
>> clause). I have set the mapred.map.tasks=13 and 
>> mapred.reduce.tasks=7.
>> Is this setting good enough for the above setup? Are there any 
>> significant configuration parameters I need to set for getting a 
>> better performance on Hive?
>>
>>
> Check on the resource utilization. I think you shouldnt be running 
> more than
> 3 mappers + 1 reducer on each node at any time (given the hardware you 
> are using). But then that mostly depends on the amount of work being 
> done in the mappers and reducers.
>
> 2) Does anybody know how exactly the data on HDFS is distributed 
> across
>> nodes in a cluster? Also when we load the tables in Hive (by firing 
>> Load command on master node),how and where is the data placed on HDFS 
>> in a cluster?
>>
>
> Files are divided into blocks and the blocks are stored on the 
> Datanodes..
> Each block is 64MB by default. I'm not sure how the blocks are 
> distributed among the datanodes..
>
> 3) How and when does the data replication for HDFS take place in a 
> cluster?
>> Currently I have set the dfs replication factor=1. How does this 
>> affect the performance?
>>
>
> Once you put the data into the hdfs, it starts replicating the blocks.
> However, the put is successful as soon as one block gets created...
>
>
>>
>> 3) Does adding a Virtual Machine to a physical machine cluster bring 
>> about significant degradation in the performance?
>>
>
> Dont have numbers for this, but it does impact the performance.  
> Moreover, your hardware resources are low and there is really no value 
> add in using virtual machines on top of it.
>
>
>> Please let me know asap.
>>
>> Thanks,
>> Ramya
>>
>>
>>
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which 
>> is the property of Persistent Systems Ltd. It is intended only for 
>> the use of the individual or entity to which it is addressed. If you 
>> are not the intended recipient, you are not authorized to read, 
>> retain, copy, print, distribute or use this message. If you have 
>> received this communication in error, please notify the sender and 
>> delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus 
>> infected mails.
>>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which 
> is the property of Persistent Systems Ltd. It is intended only for the 
> use of the individual or entity to which it is addressed. If you are 
> not the intended recipient, you are not authorized to read, retain, 
> copy, print, distribute or use this message. If you have received this 
> communication in error, please notify the sender and delete all copies 
> of this message. Persistent Systems Ltd. does not accept any liability 
> for virus infected mails.


Mime
View raw message