hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:06:44 GMT
you can do that using file:///

example:

hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/



On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> @Tariq can you point me to some resource which shows how distcp is used to
> upload files from local to hdfs.
>
> isn't distcp a MR job ? wouldn't it need the data to be already present in
> the hadoop's fs?
>
> Rahul
>
>
> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> You'r welcome :)
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Thanks Tariq!
>>>
>>>
>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> @Rahul : Yes. distcp can do that.
>>>>
>>>> And, bigger the files lesser the metadata hence lesser memory
>>>> consumption.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>> more like a general statement. Even if you put lots of small files of
>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>
>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>
>>>>> Thanks,
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>>
>>>>>> absolutely rite Mohammad
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>
>>>>>>> Sorry for barging in guys. I think Nitin is talking about this
:
>>>>>>>
>>>>>>> Every file and block in HDFS is treated as an object and for
each
>>>>>>> object around 200B of metadata get created. So the NN should
be powerful
>>>>>>> enough to handle that much metadata, since it is going to be
in-memory.
>>>>>>> Actually memory is the most important metric when it comes to
NN.
>>>>>>>
>>>>>>> Am I correct @Nitin?
>>>>>>>
>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
you
>>>>>>> don't actually just do a "put". You could use something like
"distcp" for
>>>>>>> parallel copying. A better approach would be to use a data aggregation
tool
>>>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook
uses their own
>>>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> NN would still be in picture because it will be writing a
lot of
>>>>>>>> meta data for each individual file. so you will need a NN
capable enough
>>>>>>>> which can store the metadata for your entire dataset. Data
will never go to
>>>>>>>> NN but lot of metadata about data will be on NN so its always
good idea to
>>>>>>>> have a strong NN.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but
could not
>>>>>>>>> understand the meaning of capable NN. As I know , the
NN would not be a
>>>>>>>>> part of the actual data write pipeline , means that the
data would not
>>>>>>>>> travel through the NN , the dfs would contact the NN
from time to time to
>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Rahul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>
>>>>>>>>>> when you say , you have files worth 10TB files and
you want to
>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>
>>>>>>>>>> 1) Is the machine in the same network as your hadoop
cluster?
>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>
>>>>>>>>>> and Most importantly I assume that you have a capable
hadoop
>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>
>>>>>>>>>> I would definitely not write files sequentially in
HDFS. I would
>>>>>>>>>> prefer to write files in parallel to hdfs to utilize
the DFS write features
>>>>>>>>>> to speed up the process.
>>>>>>>>>> you can hdfs put command in parallel manner and in
my experience
>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts
.
>>>>>>>>>>>
>>>>>>>>>>> But I have one more question , say I have 10
TB data in the
>>>>>>>>>>> pipeline .
>>>>>>>>>>>
>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command
to upload these
>>>>>>>>>>> files of size 10 TB and is there any limit to
the file size  using hadoop
>>>>>>>>>>> command line . Can hadoop put command line work
with huge data.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar
<
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> first of all .. most of the companies do
not get 100 PB of data
>>>>>>>>>>>> in one go. Its an accumulating process and
most of the companies do have a
>>>>>>>>>>>> data pipeline in place where the data is
written to hdfs on a frequency
>>>>>>>>>>>> basis and  then its retained on hdfs for
some duration as per needed and
>>>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>>>
>>>>>>>>>>>> For data management products, you can look
at falcon which is
>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>
>>>>>>>>>>>> In any case, if you want to write files to
hdfs there are few
>>>>>>>>>>>> options available to you
>>>>>>>>>>>> 1) Write your dfs client which writes to
dfs
>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>> 5) data collection tools come with support
to write to hdfs
>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen
Maibam <
>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can anyone help me know how does companies
like Facebook
>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the
tune of 100 petabytes to Hadoop
>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>> and after processing how they download
those files from HDFS
>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think they might be using the
command line hadoop fs
>>>>>>>>>>>>> put to upload files as it would take
too long or do they divide say 10
>>>>>>>>>>>>> parts each 10 petabytes and  compress
and use the command line hadoop fs put
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or if they use any tool to upload huge
files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Nitin Pawar

Mime
View raw message