hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ognen Duzlevski <og...@nengoiksvelzud.com>
Subject Re: HDFS question
Date Tue, 28 Jan 2014 16:55:10 GMT
Ahh. No, I do not have a job tracker. OK - I guess I need to set one up :)

Thanks!
Ognen


On Tue, Jan 28, 2014 at 10:51 AM, Bryan Beaudreault <
bbeaudreault@hubspot.com> wrote:

> Do you have a jobtracker?  Without a jobtracker and tasktrackers, distcp
> is running in LocalRunner mode.  I.E. it is running a single-threaded
> process on the local machine.  The default behavior of the DFSClient is to
> write data locally first, with replicas being placed off-rack then on-rack.
>
> This would explain why everything seems to be going locally, it is also
> probably much slower than it could be.
>
>
> On Tue, Jan 28, 2014 at 11:42 AM, Ognen Duzlevski <
> ognen@nengoiksvelzud.com> wrote:
>
>> Hello,
>>
>> I am new to Hadoop and HDFS so maybe I am not understanding things
>> properly but I have the following issue:
>>
>> I have set up a name node and a bunch of data nodes for HDFS. Each node
>> contributes 1.6TB of space so the total space shown on the hdfs web front
>> end is about 25TB. I have set the replication to be 3.
>>
>> I am downloading large files on a single data node from Amazon's S3 using
>> the -distcp command - like this:
>>
>>  hadoop --config /etc/hadoop distcp
>> s3n://AKIAIUHOFVALO67O6FJQ:DV86+JnmNiMGZH9VpdtaZZ8ZJQKyDxy6yKtDBLPp@data-pipeline/large_data/2013-12-02.json
>> hdfs://10.10.0.198:54310/test/2013-12-03.json
>>
>> Where 10.10.0.198 is the Hadoop Name node.
>>
>> All I am getting is that the machine I am running these commands on (one
>> of the data nodes) is getting all the files - they do not seem to be
>> "spreading" around the HDFS cluster.
>>
>> Is this expected? Did I completely misunderstand the point of a parallel
>> DISTRIBUTED file system? :)
>>
>> Thanks!
>> Ognen
>>
>
>

Mime
View raw message