hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:55:19 GMT
I had said that if you use distcp to copy data *from localFS to HDFS* then
you won't be able to exploit parallelism as entire file is present on a
single machine. So no multiple TTs.

Please comment if you think I am wring somewhere.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Yes , it's a MR job under the hood . my question was that you wrote that
> using distcp you loose the benefits  of parallel processing of Hadoop. I
> think the MR job of distcp divides files into individual map tasks based on
> the total size of the transfer , so multiple mappers would still be spawned
> if the size of transfer is huge and they would work in parallel.
>
> Correct me if there is anything wrong!
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> No. distcp is actually a mapreduce job under the hood.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks to both of you!
>>>
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>
>>>> you can do that using file:///
>>>>
>>>> example:
>>>>
>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>> used to upload files from local to hdfs.
>>>>>
>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>> present in the hadoop's fs?
>>>>>
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>
>>>>>> You'r welcome :)
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Tariq!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>
>>>>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>>>>> consumption.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> IMHO,I think the statement about NN with regard to block
metadata
>>>>>>>>> is more like a general statement. Even if you put lots
of small files of
>>>>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>>>>
>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <
>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for barging in guys. I think Nitin is talking
about this :
>>>>>>>>>>>
>>>>>>>>>>> Every file and block in HDFS is treated as an
object and for
>>>>>>>>>>> each object around 200B of metadata get created.
So the NN should be
>>>>>>>>>>> powerful enough to handle that much metadata,
since it is going to be
>>>>>>>>>>> in-memory. Actually memory is the most important
metric when it comes to
>>>>>>>>>>> NN.
>>>>>>>>>>>
>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>
>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk about
that much data
>>>>>>>>>>> you don't actually just do a "put". You could
use something like "distcp"
>>>>>>>>>>> for parallel copying. A better approach would
be to use a data aggregation
>>>>>>>>>>> tool like Flume or Chukwa, as Nitin has already
pointed. Facebook uses
>>>>>>>>>>> their own data aggregation tool, called Scribe
for this purpose.
>>>>>>>>>>>
>>>>>>>>>>> Warm Regards,
>>>>>>>>>>> Tariq
>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar
<
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> NN would still be in picture because it will
be writing a lot
>>>>>>>>>>>> of meta data for each individual file. so
you will need a NN capable enough
>>>>>>>>>>>> which can store the metadata for your entire
dataset. Data will never go to
>>>>>>>>>>>> NN but lot of metadata about data will be
on NN so its always good idea to
>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee
<
>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Nitin , parallel dfs to write to hdfs
is great , but could
>>>>>>>>>>>>> not understand the meaning of capable
NN. As I know , the NN would not be a
>>>>>>>>>>>>> part of the actual data write pipeline
, means that the data would not
>>>>>>>>>>>>> travel through the NN , the dfs would
contact the NN from time to time to
>>>>>>>>>>>>> get locations of DN as where to store
the data blocks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin
Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is it safe? .. there is no direct
answer yes or no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when you say , you have files worth
10TB files and you want
>>>>>>>>>>>>>> to upload  to HDFS, several factors
come into picture
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Is the machine in the same network
as your hadoop cluster?
>>>>>>>>>>>>>> 2) If there guarantee that network
will not go down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and Most importantly I assume that
you have a capable hadoop
>>>>>>>>>>>>>> cluster. By that I mean you have
a capable namenode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would definitely not write files
sequentially in HDFS. I
>>>>>>>>>>>>>> would prefer to write files in parallel
to hdfs to utilize the DFS write
>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>> you can hdfs put command in parallel
manner and in my
>>>>>>>>>>>>>> experience it has not failed when
we write a lot of data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM,
maisnam ns <
>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin Pawar , thanks for clearing
my doubts .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I have one more question
, say I have 10 TB data in the
>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it perfectly OK to use hadopo
fs put command to upload
>>>>>>>>>>>>>>> these files of size 10 TB and
is there any limit to the file size  using
>>>>>>>>>>>>>>> hadoop command line . Can hadoop
put command line work with huge data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24
PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> first of all .. most of the
companies do not get 100 PB of
>>>>>>>>>>>>>>>> data in one go. Its an accumulating
process and most of the companies do
>>>>>>>>>>>>>>>> have a data pipeline in place
where the data is written to hdfs on a
>>>>>>>>>>>>>>>> frequency basis and  then
its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>> needed and from there its
sent to archivers or deleted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For data management products,
you can look at falcon which
>>>>>>>>>>>>>>>> is open sourced by inmobi
along with hortonworks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, if you want
to write files to hdfs there are
>>>>>>>>>>>>>>>> few options available to
you
>>>>>>>>>>>>>>>> 1) Write your dfs client
which writes to dfs
>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>> 5) data collection tools
come with support to write to hdfs
>>>>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19
PM, Thoihen Maibam <
>>>>>>>>>>>>>>>> thoihen123@gmail.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Can anyone help me know
how does companies like Facebook
>>>>>>>>>>>>>>>>> ,Yahoo etc upload bulk
files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>>>>> and after processing
how they download those files from
>>>>>>>>>>>>>>>>> HDFS to local file system.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I don't think they might
be using the command line hadoop
>>>>>>>>>>>>>>>>> fs put to upload files
as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>> parts each 10 petabytes
and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or if they use any tool
to upload huge files.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Mime
View raw message