hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 11:53:19 GMT
@Tariq can you point me to some resource which shows how distcp is used to
upload files from local to hdfs.

isn't distcp a MR job ? wouldn't it need the data to be already present in
the hadoop's fs?

Rahul


On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> You'r welcome :)
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Thanks Tariq!
>>
>>
>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> @Rahul : Yes. distcp can do that.
>>>
>>> And, bigger the files lesser the metadata hence lesser memory
>>> consumption.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>> more like a general statement. Even if you put lots of small files of
>>>> combined size 10 TB , you need to have a capable NN.
>>>>
>>>> can disct cp be used to copy local - to - hdfs ?
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> absolutely rite Mohammad
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>
>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>
>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>
>>>>>> Am I correct @Nitin?
>>>>>>
>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
you
>>>>>> don't actually just do a "put". You could use something like "distcp"
for
>>>>>> parallel copying. A better approach would be to use a data aggregation
tool
>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses
their own
>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> NN would still be in picture because it will be writing a lot
of
>>>>>>> meta data for each individual file. so you will need a NN capable
enough
>>>>>>> which can store the metadata for your entire dataset. Data will
never go to
>>>>>>> NN but lot of metadata about data will be on NN so its always
good idea to
>>>>>>> have a strong NN.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could
not
>>>>>>>> understand the meaning of capable NN. As I know , the NN
would not be a
>>>>>>>> part of the actual data write pipeline , means that the data
would not
>>>>>>>> travel through the NN , the dfs would contact the NN from
time to time to
>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>
>>>>>>>>> when you say , you have files worth 10TB files and you
want to
>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>
>>>>>>>>> 1) Is the machine in the same network as your hadoop
cluster?
>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>
>>>>>>>>> and Most importantly I assume that you have a capable
hadoop
>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>
>>>>>>>>> I would definitely not write files sequentially in HDFS.
I would
>>>>>>>>> prefer to write files in parallel to hdfs to utilize
the DFS write features
>>>>>>>>> to speed up the process.
>>>>>>>>> you can hdfs put command in parallel manner and in my
experience
>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>
>>>>>>>>>> But I have one more question , say I have 10 TB data
in the
>>>>>>>>>> pipeline .
>>>>>>>>>>
>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to
upload these
>>>>>>>>>> files of size 10 TB and is there any limit to the
file size  using hadoop
>>>>>>>>>> command line . Can hadoop put command line work with
huge data.
>>>>>>>>>>
>>>>>>>>>> Thanks in advance
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> first of all .. most of the companies do not
get 100 PB of data
>>>>>>>>>>> in one go. Its an accumulating process and most
of the companies do have a
>>>>>>>>>>> data pipeline in place where the data is written
to hdfs on a frequency
>>>>>>>>>>> basis and  then its retained on hdfs for some
duration as per needed and
>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>
>>>>>>>>>>> For data management products, you can look at
falcon which is
>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>
>>>>>>>>>>> In any case, if you want to write files to hdfs
there are few
>>>>>>>>>>> options available to you
>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>> 5) data collection tools come with support to
write to hdfs like
>>>>>>>>>>> flume etc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam
<
>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> Can anyone help me know how does companies
like Facebook ,Yahoo
>>>>>>>>>>>> etc upload bulk files say to the tune of
100 petabytes to Hadoop HDFS
>>>>>>>>>>>> cluster for processing
>>>>>>>>>>>> and after processing how they download those
files from HDFS to
>>>>>>>>>>>> local file system.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think they might be using the command
line hadoop fs
>>>>>>>>>>>> put to upload files as it would take too
long or do they divide say 10
>>>>>>>>>>>> parts each 10 petabytes and  compress and
use the command line hadoop fs put
>>>>>>>>>>>>
>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>
>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> thoihen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message