hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sat, 11 May 2013 17:16:45 GMT
Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to
NN
>>>>> but lot of metadata about data will be on NN so its always good idea
to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not
be a
>>>>>> part of the actual data write pipeline , means that the data would
not
>>>>>> travel through the NN , the dfs would contact the NN from time to
time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS
write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience
it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload
these
>>>>>>>> files of size 10 TB and is there any limit to the file size
 using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge
data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100
PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies
do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs
on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration
as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon
which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there
are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to
hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook
,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes
to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files
from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line
hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they
divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command
line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Mime
View raw message