hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: Configuring hadoop 2.2.0
Date Wed, 29 Jan 2014 14:11:23 GMT
By the way, I discovered the start-balancer.sh script that comes with HDFS
- after running it with -threshold 5, I get the following output in the
logs:

2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[
10.10.0.200:50010, utilization=76.45474474120932]]
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 936.81 GB to
make the cluster balanced.
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB
bytes from 10.10.0.200:50010 to 10.10.0.203:50010
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 10 GB in this
iteration

Maybe this sheds more light on what I am talking about? In any case, why do
I need to run the balancer manually? Or do I?
Ognen


On Wed, Jan 29, 2014 at 8:05 AM, Ognen Duzlevski
<ognen@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <java8964@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Mime
View raw message