hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Dealing with low space cluster
Date Thu, 14 Jun 2012 14:49:05 GMT
Hi,

If you aren't using access lists (include/exclude), just place conf
files (same as other slaves, or tweaked where necessary), and start
them. They will join automatically and you will see them on the live
nodes list immediately. You do not need to run the refreshing commands
when not using the incl./exclude lists.

For HDFS, in case cluster already has some data, make sure to run the
data balancer afterwards. See
http://hadoop.apache.org/common/docs/stable/hdfs_user_guide.html#Rebalancer

On Thu, Jun 14, 2012 at 7:46 PM, Ondřej Klimpera <klimpond@fit.cvut.cz> wrote:
> Thanks, I'll try.
>
> One more question, I've got few more nodes, which can be added to the
> cluster. But how to do that?
>
> If I understand it (according to Hadoop's wiki pages):
>
> 1. On master node - edit slaves file and add IP addresses of new nodes
> (everything clear)
> 2. log in to each newly added node and run (it's clear to me too)
>
> $ hadoop-daemon.sh start datanode
> $ hadoop-daemon.sh start tasktracker
>
> 3. Now I'm not sure, I'm not using dfs.include/mapred.include, do I have to
> run commands:
>
> $ hadoop dfsadmin -refreshNodes
> $ hadoop mradmin -refreshNodes
>
> If yes, must it be run on master node, or new slaves nodes?
>
> Ondrej
>
>
>
>
> On 06/14/2012 04:03 PM, Harsh J wrote:
>>
>> Ondřej,
>>
>> That isn't currently possible with local storage FS. Your 1 TB NFS
>> point can help but I suspect it may act as a slow-down point if nodes
>> use it in parallel. Perhaps mount it only on 3-4 machines (or less),
>> instead of all, to avoid that?
>>
>> On Thu, Jun 14, 2012 at 7:28 PM, Ondřej Klimpera<klimpond@fit.cvut.cz>
>>  wrote:
>>>
>>> Hello,
>>>
>>> you're right. That's exactly what I ment. And your answer is exactly what
>>> I
>>> thought. I was just wondering if Hadoop can distribute the data to other
>>> node's local storages if own local space is full.
>>>
>>> Thanks
>>>
>>>
>>> On 06/14/2012 03:38 PM, Harsh J wrote:
>>>>
>>>> Ondřej,
>>>>
>>>> If by processing you mean trying to write out (map outputs)>    20 GB
of
>>>> data per map task, that may not be possible, as the outputs need to be
>>>> materialized and the disk space is the constraint there.
>>>>
>>>> Or did I not understand you correctly (in thinking you are asking
>>>> about MapReduce)? Cause you otherwise have ~50 GB space available for
>>>> HDFS consumption (assuming replication = 3 for proper reliability).
>>>>
>>>> On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimpera<klimpond@fit.cvut.cz>
>>>>  wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> we're testing application on 8 nodes, where each node has 20GB of local
>>>>> storage available. What we are trying to achieve is to get more than
>>>>> 20GB
>>>>> to
>>>>> be processed on this cluster.
>>>>>
>>>>> Is there a way how to distribute the data on the cluster?
>>>>>
>>>>> There is also one shared NFS storage disk with 1TB of available space,
>>>>> which
>>>>> is now unused.
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Ondrej Klimpera
>>>>
>>>>
>>>>
>>
>>
>



-- 
Harsh J

Mime
View raw message