hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sat, 11 May 2013 16:10:27 GMT
IMHO,I think the statement about NN with regard to block metadata is more
like a general statement. Even if you put lots of small files of combined
size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:

> absolutely rite Mohammad
>
>
> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> Sorry for barging in guys. I think Nitin is talking about this :
>>
>> Every file and block in HDFS is treated as an object and for each object
>> around 200B of metadata get created. So the NN should be powerful enough to
>> handle that much metadata, since it is going to be in-memory. Actually
>> memory is the most important metric when it comes to NN.
>>
>> Am I correct @Nitin?
>>
>> @Thoihen : As Nitin has said, when you talk about that much data you
>> don't actually just do a "put". You could use something like "distcp" for
>> parallel copying. A better approach would be to use a data aggregation tool
>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>> data aggregation tool, called Scribe for this purpose.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> NN would still be in picture because it will be writing a lot of meta
>>> data for each individual file. so you will need a NN capable enough which
>>> can store the metadata for your entire dataset. Data will never go to NN
>>> but lot of metadata about data will be on NN so its always good idea to
>>> have a strong NN.
>>>
>>>
>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>> part of the actual data write pipeline , means that the data would not
>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>> get locations of DN as where to store the data blocks.
>>>>
>>>> Thanks,
>>>> Rahul
>>>>
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> is it safe? .. there is no direct answer yes or no
>>>>>
>>>>> when you say , you have files worth 10TB files and you want to upload
>>>>>  to HDFS, several factors come into picture
>>>>>
>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>> 2) If there guarantee that network will not go down?
>>>>>
>>>>> and Most importantly I assume that you have a capable hadoop cluster.
>>>>> By that I mean you have a capable namenode.
>>>>>
>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>> to speed up the process.
>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>> has not failed when we write a lot of data.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>
>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>
>>>>>> But I have one more question , say I have 10 TB data in the pipeline
.
>>>>>>
>>>>>> Is it perfectly OK to use hadopo fs put command to upload these files
>>>>>> of size 10 TB and is there any limit to the file size  using hadoop
command
>>>>>> line . Can hadoop put command line work with huge data.
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> first of all .. most of the companies do not get 100 PB of data
in
>>>>>>> one go. Its an accumulating process and most of the companies
do have a
>>>>>>> data pipeline in place where the data is written to hdfs on a
frequency
>>>>>>> basis and  then its retained on hdfs for some duration as per
needed and
>>>>>>> from there its sent to archivers or deleted.
>>>>>>>
>>>>>>> For data management products, you can look at falcon which is
open
>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>
>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>> options available to you
>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>> 2) use hdfs proxy
>>>>>>> 3) there is webhdfs
>>>>>>> 4) command line hdfs
>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>> flume etc
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Can anyone help me know how does companies like Facebook
,Yahoo etc
>>>>>>>> upload bulk files say to the tune of 100 petabytes to Hadoop
HDFS cluster
>>>>>>>> for processing
>>>>>>>> and after processing how they download those files from HDFS
to
>>>>>>>> local file system.
>>>>>>>>
>>>>>>>> I don't think they might be using the command line hadoop
fs put to
>>>>>>>> upload files as it would take too long or do they divide
say 10 parts each
>>>>>>>> 10 petabytes and  compress and use the command line hadoop
fs put
>>>>>>>>
>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>
>>>>>>>> Please help me .
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> thoihen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Mime
View raw message