hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: Configuring hadoop 2.2.0
Date Wed, 29 Jan 2014 14:21:21 GMT
Also, does anyone know how I can "force" the rebalancer to move more data
in one run? At the current settings, it will take about a week to rebalance
the nodes ;)

Ognen


On Wed, Jan 29, 2014 at 8:12 AM, Ognen Duzlevski
<ognen@nengoiksvelzud.com>wrote:

> Ahh, OK :)
>
> However, this seems kind of silly - it may be stored in the datanode but I
> find the need to "force" the balancing manually somewhat strange. I mean
> why use hdfs://namenode:port/path/file if the copies end up being stored
> locally anyway? ;)
>
> Ognen
>
>
> On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ssenkul1@gmail.com> wrote:
>
>> Try to run the command from the namenode, or another node which is not a
>> datanode, the files should distribute. As far as I know, if you copy a file
>> to hdfs from a datanode, the first copy is stored in that datanode.
>>
>> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>> Hello (and thanks for replying!) :)
>>>
>>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <java8964@hotmail.com> wrote:
>>>
>>>> Hi, Ognen:
>>>>
>>>> I noticed you were asking this question before under a different
>>>> subject line. I think you need to tell us where you mean unbalance space,
>>>> is it on HDFS or the local disk.
>>>>
>>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>>
>>>
>>> OK good to know.
>>>
>>>
>>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>>> all HDFS command, API will just work.
>>>>
>>>
>>> Good to know. Does this also mean that when I put or distcp file to
>>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>>> across all the datanodes so as the nodes are utilized equally in terms of
>>> space?
>>>
>>>
>>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>>> MapReduce to do the massively parallel copying files.
>>>>
>>>
>>> Understood.
>>>
>>>
>>>> 4) Your original problem is that when you run the distcp command, you
>>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>>> files to the LOCAL file system, based on some one else's reply to your
>>>> original question. I didn't test this myself before, but I kind of believe
>>>> that.
>>>>
>>>
>>> Sure. But even if distcp is running in one thread, its destination is
>>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>>> files across the whole HDFS cluster? Or am I delusional? :)
>>>
>>>
>>>> 5) If the above is true, then you should see under node your were
>>>> running distcp command there should be having these files in the local file
>>>> system, in the path you specified. You should check and verify that.
>>>>
>>>
>>> OK - so the command is this:
>>>
>>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file
>>> hdfs://10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name
>>> node. I am running this on 10.10.0.200 which is one of the Data nodes and I
>>> am making no mention of the local data node storage in this command. My
>>> expectation is that the files obtained this way from S3 will end up
>>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>>> cluster. Am I wrong to expect this?
>>>
>>> 6) After you start yarn/resource manager, you see the unbalance after
>>>> you distcp files again. Where is this unbalance? In the HDFS or local file
>>>> system. List the commands  and outputs here, so we can understand your
>>>> problem more clearly, instead of misleading sometimes by your words.
>>>>
>>>
>>> The imbalance is as follows: the machine I run the distcp command on
>>> (one of the Data nodes) ends up with 70+% of the space it is contributing
>>> to the HDFS cluster occupied with these files while the rest of the data
>>> nodes in the cluster only get 10% of their contributed space occupied.
>>> Since HDFS is a distributed, parallel file system I would expect that the
>>> file space occupied would be spread evenly or somewhat evenly across all
>>> the data nodes.
>>>
>>> Thanks!
>>> Ognen
>>>
>>
>>
>

Mime
View raw message