hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:45:05 GMT
Yes , it's a MR job under the hood . my question was that you wrote that
using distcp you loose the benefits  of parallel processing of Hadoop. I
think the MR job of distcp divides files into individual map tasks based on
the total size of the transfer , so multiple mappers would still be spawned
if the size of transfer is huge and they would work in parallel.

Correct me if there is anything wrong!

Thanks,
Rahul


On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> No. distcp is actually a mapreduce job under the hood.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks to both of you!
>>
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> you can do that using file:///
>>>
>>> example:
>>>
>>>
>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Tariq can you point me to some resource which shows how distcp is used
>>>> to upload files from local to hdfs.
>>>>
>>>> isn't distcp a MR job ? wouldn't it need the data to be already present
>>>> in the hadoop's fs?
>>>>
>>>>  Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> You'r welcome :)
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> Thanks Tariq!
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>
>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>
>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>> consumption.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> IMHO,I think the statement about NN with regard to block
metadata
>>>>>>>> is more like a general statement. Even if you put lots of
small files of
>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>
>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking
about this :
>>>>>>>>>>
>>>>>>>>>> Every file and block in HDFS is treated as an object
and for each
>>>>>>>>>> object around 200B of metadata get created. So the
NN should be powerful
>>>>>>>>>> enough to handle that much metadata, since it is
going to be in-memory.
>>>>>>>>>> Actually memory is the most important metric when
it comes to NN.
>>>>>>>>>>
>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>
>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about
that much data
>>>>>>>>>> you don't actually just do a "put". You could use
something like "distcp"
>>>>>>>>>> for parallel copying. A better approach would be
to use a data aggregation
>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed.
Facebook uses
>>>>>>>>>> their own data aggregation tool, called Scribe for
this purpose.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> NN would still be in picture because it will
be writing a lot of
>>>>>>>>>>> meta data for each individual file. so you will
need a NN capable enough
>>>>>>>>>>> which can store the metadata for your entire
dataset. Data will never go to
>>>>>>>>>>> NN but lot of metadata about data will be on
NN so its always good idea to
>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee
<
>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is
great , but could not
>>>>>>>>>>>> understand the meaning of capable NN. As
I know , the NN would not be a
>>>>>>>>>>>> part of the actual data write pipeline ,
means that the data would not
>>>>>>>>>>>> travel through the NN , the dfs would contact
the NN from time to time to
>>>>>>>>>>>> get locations of DN as where to store the
data blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Rahul
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar
<
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> is it safe? .. there is no direct answer
yes or no
>>>>>>>>>>>>>
>>>>>>>>>>>>> when you say , you have files worth 10TB
files and you want to
>>>>>>>>>>>>> upload  to HDFS, several factors come
into picture
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) Is the machine in the same network
as your hadoop cluster?
>>>>>>>>>>>>> 2) If there guarantee that network will
not go down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> and Most importantly I assume that you
have a capable hadoop
>>>>>>>>>>>>> cluster. By that I mean you have a capable
namenode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would definitely not write files sequentially
in HDFS. I
>>>>>>>>>>>>> would prefer to write files in parallel
to hdfs to utilize the DFS write
>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>> you can hdfs put command in parallel
manner and in my
>>>>>>>>>>>>> experience it has not failed when we
write a lot of data.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam
ns <
>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing
my doubts .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I have one more question , say
I have 10 TB data in the
>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it perfectly OK to use hadopo
fs put command to upload
>>>>>>>>>>>>>> these files of size 10 TB and is
there any limit to the file size  using
>>>>>>>>>>>>>> hadoop command line . Can hadoop
put command line work with huge data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM,
Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> first of all .. most of the companies
do not get 100 PB of
>>>>>>>>>>>>>>> data in one go. Its an accumulating
process and most of the companies do
>>>>>>>>>>>>>>> have a data pipeline in place
where the data is written to hdfs on a
>>>>>>>>>>>>>>> frequency basis and  then its
retained on hdfs for some duration as per
>>>>>>>>>>>>>>> needed and from there its sent
to archivers or deleted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For data management products,
you can look at falcon which
>>>>>>>>>>>>>>> is open sourced by inmobi along
with hortonworks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, if you want to write
files to hdfs there are
>>>>>>>>>>>>>>> few options available to you
>>>>>>>>>>>>>>> 1) Write your dfs client which
writes to dfs
>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>> 5) data collection tools come
with support to write to hdfs
>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19
PM, Thoihen Maibam <
>>>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can anyone help me know how
does companies like Facebook
>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk files
say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>> and after processing how
they download those files from
>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think they might
be using the command line hadoop
>>>>>>>>>>>>>>>> fs put to upload files as
it would take too long or do they divide say 10
>>>>>>>>>>>>>>>> parts each 10 petabytes and
 compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or if they use any tool to
upload huge files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Mime
View raw message