hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:30:50 GMT
Thanks to both of you!

Rahul


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata
is
>>>>>> more like a general statement. Even if you put lots of small files
of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about
this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and
for each
>>>>>>>> object around 200B of metadata get created. So the NN should
be powerful
>>>>>>>> enough to handle that much metadata, since it is going to
be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes
to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much
data
>>>>>>>> you don't actually just do a "put". You could use something
like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a
data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed.
Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing
a lot of
>>>>>>>>> meta data for each individual file. so you will need
a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset.
Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its
always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee
<
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great ,
but could not
>>>>>>>>>> understand the meaning of capable NN. As I know ,
the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that
the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the
NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes
or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files
and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your
hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go
down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a
capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially
in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to
utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and
in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts
.
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have
10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command
to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit
to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line
work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar
<
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies
do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process
and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the
data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained
on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers
or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can
look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files
to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes
to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support
to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen
Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does
companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say
to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download
those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using
the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take
too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress
and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload
huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

Mime
View raw message