hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:37:50 GMT
No. distcp is actually a mapreduce job under the hood.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks to both of you!
>
> Rahul
>
>
> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>
>> you can do that using file:///
>>
>> example:
>>
>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> @Tariq can you point me to some resource which shows how distcp is used
>>> to upload files from local to hdfs.
>>>
>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>> in the hadoop's fs?
>>>
>>>  Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> You'r welcome :)
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks Tariq!
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>
>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>
>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>> consumption.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> IMHO,I think the statement about NN with regard to block metadata
is
>>>>>>> more like a general statement. Even if you put lots of small
files of
>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>
>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> absolutely rite Mohammad
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
this :
>>>>>>>>>
>>>>>>>>> Every file and block in HDFS is treated as an object
and for each
>>>>>>>>> object around 200B of metadata get created. So the NN
should be powerful
>>>>>>>>> enough to handle that much metadata, since it is going
to be in-memory.
>>>>>>>>> Actually memory is the most important metric when it
comes to NN.
>>>>>>>>>
>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>
>>>>>>>>> @Thoihen : As Nitin has said, when you talk about that
much data
>>>>>>>>> you don't actually just do a "put". You could use something
like "distcp"
>>>>>>>>> for parallel copying. A better approach would be to use
a data aggregation
>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed.
Facebook uses
>>>>>>>>> their own data aggregation tool, called Scribe for this
purpose.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> NN would still be in picture because it will be writing
a lot of
>>>>>>>>>> meta data for each individual file. so you will need
a NN capable enough
>>>>>>>>>> which can store the metadata for your entire dataset.
Data will never go to
>>>>>>>>>> NN but lot of metadata about data will be on NN so
its always good idea to
>>>>>>>>>> have a strong NN.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee
<
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great
, but could not
>>>>>>>>>>> understand the meaning of capable NN. As I know
, the NN would not be a
>>>>>>>>>>> part of the actual data write pipeline , means
that the data would not
>>>>>>>>>>> travel through the NN , the dfs would contact
the NN from time to time to
>>>>>>>>>>> get locations of DN as where to store the data
blocks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar
<
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> is it safe? .. there is no direct answer
yes or no
>>>>>>>>>>>>
>>>>>>>>>>>> when you say , you have files worth 10TB
files and you want to
>>>>>>>>>>>> upload  to HDFS, several factors come into
picture
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Is the machine in the same network as
your hadoop cluster?
>>>>>>>>>>>> 2) If there guarantee that network will not
go down?
>>>>>>>>>>>>
>>>>>>>>>>>> and Most importantly I assume that you have
a capable hadoop
>>>>>>>>>>>> cluster. By that I mean you have a capable
namenode.
>>>>>>>>>>>>
>>>>>>>>>>>> I would definitely not write files sequentially
in HDFS. I
>>>>>>>>>>>> would prefer to write files in parallel to
hdfs to utilize the DFS write
>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>> you can hdfs put command in parallel manner
and in my
>>>>>>>>>>>> experience it has not failed when we write
a lot of data.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam
ns <
>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my
doubts .
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I have one more question , say I
have 10 TB data in the
>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put
command to upload
>>>>>>>>>>>>> these files of size 10 TB and is there
any limit to the file size  using
>>>>>>>>>>>>> hadoop command line . Can hadoop put
command line work with huge data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin
Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> first of all .. most of the companies
do not get 100 PB of
>>>>>>>>>>>>>> data in one go. Its an accumulating
process and most of the companies do
>>>>>>>>>>>>>> have a data pipeline in place where
the data is written to hdfs on a
>>>>>>>>>>>>>> frequency basis and  then its retained
on hdfs for some duration as per
>>>>>>>>>>>>>> needed and from there its sent to
archivers or deleted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For data management products, you
can look at falcon which is
>>>>>>>>>>>>>> open sourced by inmobi along with
hortonworks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, if you want to write
files to hdfs there are few
>>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>>> 1) Write your dfs client which writes
to dfs
>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>> 5) data collection tools come with
support to write to hdfs
>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM,
Thoihen Maibam <
>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can anyone help me know how does
companies like Facebook
>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files
say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>> and after processing how they
download those files from HDFS
>>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think they might be using
the command line hadoop fs
>>>>>>>>>>>>>>> put to upload files as it would
take too long or do they divide say 10
>>>>>>>>>>>>>>> parts each 10 petabytes and 
compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or if they use any tool to upload
huge files.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Mime
View raw message