hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Hadoop noob question
Date Sat, 11 May 2013 17:22:24 GMT
You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp"
for
>>>>> parallel copying. A better approach would be to use a data aggregation
tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their
own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of
meta
>>>>>> data for each individual file. so you will need a NN capable enough
which
>>>>>> can store the metadata for your entire dataset. Data will never go
to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea
to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would
not be a
>>>>>>> part of the actual data write pipeline , means that the data
would not
>>>>>>> travel through the NN , the dfs would contact the NN from time
to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want
to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS.
I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the
DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience
it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data
in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
these
>>>>>>>>> files of size 10 TB and is there any limit to the file
size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with
huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get
100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of
the companies do have a
>>>>>>>>>> data pipeline in place where the data is written
to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration
as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon
which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there
are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write
to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like
Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100
petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those
files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command
line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or
do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command
line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

Mime
View raw message