Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates
 209.85.128.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAORpBshhycyCGzDWgYsNFGb1gQ928HERFg05M6XZOz_m5uFhKg@mail.gmail.com>
References: 
 <CAKY2LB5LDca0Z4avsk79HvrcKwDhwrQzJxb_egceb63rtD2qSw@mail.gmail.com>
 <CAORpBsghbOHe-G+1nrX_L=kVw5fk_fztuHuUUH=_dRHc82LN_A@mail.gmail.com>
 <CAON7oqTHm3FO4Y=Av4pv9Us64TZDQ04ZHDpWZpKd8c7z=R+_GA@mail.gmail.com>
 <CAORpBsiYLT4VCoALA6Ptg4TmZDdbuGEBLfUhkdHfRBXsaB9nTw@mail.gmail.com>
 <CAO7hTbOKv9Zn3GdMsOzRrG1PqkYeg3t77Keu8XKjw87J4OJ0TA@mail.gmail.com>
 <CAORpBsikpMLJO8N4LhapuxDYAS8+65B43gXEMxEr0FdOCeoh4A@mail.gmail.com>
 <CAMVC6RODYKL+oBGf=3z3FeuYuor80fX_tr-wiz-RMPbyd-jyOw@mail.gmail.com>
 <CAORpBshP2_Wv-PiPE6LoWZ0duqv-HYHkkFayc33PcOPHbOMXbg@mail.gmail.com>
 <CAO7hTbMkt6LVhuWqrdkQdqLbXh5PRqde67+TAt4+bAjMNTX-wA@mail.gmail.com>
 <CAMVC6RM4gfChtyWkhibRLQ__9NYzOPqLdw31=TfgZTeheog3LA@mail.gmail.com>
 <CAO7hTbNkFDZS1GBmgNxnRWkExnwYtiFmSrcDSrRdxQhz247tjw@mail.gmail.com>
 <CAMVC6RP3s_7uR4gcRqJPj8QC4OTmE6kBA2sOo=fOj3tQ2XsLgw@mail.gmail.com>
 <CAO7hTbPzF1ERntRP_sZiHJy-XcOY6C85TxspOyXjMbyoOd4N4g@mail.gmail.com>
 <CAORpBshhycyCGzDWgYsNFGb1gQ928HERFg05M6XZOz_m5uFhKg@mail.gmail.com>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Sun, 12 May 2013 17:40:40 +0530
Message-ID: 
 <CAMVC6ROmEa1epLTowrjxLfppV=hGN==BWF2Ndan6_9QPuHNBFA@mail.gmail.com>
Subject: Re: Hadoop noob question
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7bb04ce65fb5b604dc844c58

--047d7bb04ce65fb5b604dc844c58
Content-Type: text/plain; charset=ISO-8859-1

@Rahul : I'm sorry I answered this on a wrong thread by mistake. You could
do that as Nitin has shown.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:

> you can do that using file:///
>
> example:
>
> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> @Tariq can you point me to some resource which shows how distcp is used
>> to upload files from local to hdfs.
>>
>> isn't distcp a MR job ? wouldn't it need the data to be already present
>> in the hadoop's fs?
>>
>>  Rahul
>>
>>
>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> You'r welcome :)
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks Tariq!
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> @Rahul : Yes. distcp can do that.
>>>>>
>>>>> And, bigger the files lesser the metadata hence lesser memory
>>>>> consumption.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> IMHO,I think the statement about NN with regard to block metadata is
>>>>>> more like a general statement. Even if you put lots of small files of
>>>>>> combined size 10 TB , you need to have a capable NN.
>>>>>>
>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> absolutely rite Mohammad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>>
>>>>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>>>>
>>>>>>>> Every file and block in HDFS is treated as an object and for each
>>>>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>>>>
>>>>>>>> Am I correct @Nitin?
>>>>>>>>
>>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data
>>>>>>>> you don't actually just do a "put". You could use something like "distcp"
>>>>>>>> for parallel copying. A better approach would be to use a data aggregation
>>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses
>>>>>>>> their own data aggregation tool, called Scribe for this purpose.
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> NN would still be in picture because it will be writing a lot of
>>>>>>>>> meta data for each individual file. so you will need a NN capable enough
>>>>>>>>> which can store the metadata for your entire dataset. Data will never go to
>>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to
>>>>>>>>> have a strong NN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>>>>
>>>>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>>>>
>>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>>>>
>>>>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>>>>
>>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>>>>> to speed up the process.
>>>>>>>>>>> you can hdfs put command in parallel manner and in my experience
>>>>>>>>>>> it has not failed when we write a lot of data.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <
>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>>>>
>>>>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>
>>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of
>>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do
>>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a
>>>>>>>>>>>>> frequency basis and  then its retained on hdfs for some duration as per
>>>>>>>>>>>>> needed and from there its sent to archivers or deleted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>>>>> options available to you
>>>>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs
>>>>>>>>>>>>> like flume etc
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook
>>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>> HDFS cluster for processing
>>>>>>>>>>>>>> and after processing how they download those files from HDFS
>>>>>>>>>>>>>> to local file system.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs
>>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10
>>>>>>>>>>>>>> parts each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Nitin Pawar
>

--047d7bb04ce65fb5b604dc844c58
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">@Rahul : I&#39;m sorry I answered this on a wrong thread b=
y mistake. You could do that as Nitin has shown.</div><div class=3D"gmail_e=
xtra"><br clear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div>=
<div>

<a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blo=
gspot.com</a><br></div></div></div>
<br><br><div class=3D"gmail_quote">On Sun, May 12, 2013 at 5:36 PM, Nitin P=
awar <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" targe=
t=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">

<div dir=3D"ltr">you can do that using file:///=A0<div><br></div><div>examp=
le:=A0</div><div><br></div><pre style=3D"max-height:600px;width:auto;backgr=
ound-color:rgb(238,238,238);margin-bottom:10px;padding:5px;vertical-align:b=
aseline;line-height:18px;font-size:14px;overflow:auto;font-family:Consolas,=
Menlo,Monaco,&#39;Lucida Console&#39;,&#39;Liberation Mono&#39;,&#39;DejaVu=
 Sans Mono&#39;,&#39;Bitstream Vera Sans Mono&#39;,&#39;Courier New&#39;,mo=
nospace,serif;margin-top:0px;border:0px">

<code style=3D"margin:0px;padding:0px;border:0px;vertical-align:baseline;fo=
nt-family:Consolas,Menlo,Monaco,&#39;Lucida Console&#39;,&#39;Liberation Mo=
no&#39;,&#39;DejaVu Sans Mono&#39;,&#39;Bitstream Vera Sans Mono&#39;,&#39;=
Courier New&#39;,monospace,serif">hadoop distcp hdfs://localhost:8020/somef=
ile file:///Users/myhome/Desktop/<br>


</code></pre></div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><b=
r><div class=3D"gmail_quote">On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattac=
harjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.com" tar=
get=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:courier new,monospace">@Tariq can you point me to so=
me resource which shows how distcp is used to upload files from local to hd=
fs.<br>


<br></div>

<div class=3D"gmail_default" style=3D"font-family:courier new,monospace">is=
n&#39;t distcp a MR job ? wouldn&#39;t it need the data to be already prese=
nt in the hadoop&#39;s fs?<span><font color=3D"#888888"><br><br>
</font></span></div><span><font color=3D"#888888"><div class=3D"gmail_defau=
lt" style=3D"font-family:courier new,monospace">

Rahul<br></div></font></span></div><div><div><div class=3D"gmail_extra"><br=
><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 10:52 PM, Mohammad =
Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" target=3D=
"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">You&#39;r welcome :)</div><=
div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Warm Rega=
rds,<div>


Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" target=3D"_blank=
">cloudfront.blogspot.com</a><br>

</div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 10:46 PM, Rahul =
Bhattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.c=
om" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">Thanks Tariq!<br></div></div><div><div><div class=3D"gmail_e=
xtra"><br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 10:34 PM, =
Mohammad Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" =
target=3D"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">@Rahul : Yes. distcp can do=
 that.<div><br></div><div>And, bigger the files lesser the metadata hence l=
esser memory consumption.</div>


</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>

Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" target=3D"_blank=
">cloudfront.blogspot.com</a><br></div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:40 PM, Rahul B=
hattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.co=
m" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">IMHO,I think the statement about NN with regard to block met=
adata is more like a general statement. Even if you put lots of small files=
 of combined size 10 TB , you need to have a capable NN.<br>


<br>can disct cp be used to copy local - to - hdfs ?<br><br></div><div clas=
s=3D"gmail_default" style=3D"font-family:courier new,monospace">Thanks,<br>=
</div><div class=3D"gmail_default" style=3D"font-family:courier new,monospa=
ce">


Rahul<br></div></div><div><div><div class=3D"gmail_extra"><br><br><div clas=
s=3D"gmail_quote">On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <span dir=3D=
"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank">niti=
npawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">absolutely rite Mohammad=A0=
</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">

On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:dontariq@gmail.com" target=3D"_blank">dontariq@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Sorry for barging in guys. =
I think Nitin is talking about this :<div><br></div><div>Every file and blo=
ck in HDFS is treated as an object and for each object around 200B of metad=
ata get created. So the NN should be powerful enough to handle that much me=
tadata, since it is going to be in-memory. Actually memory is the most impo=
rtant metric when it comes to NN.=A0</div>


<div><br></div><div>Am I correct @Nitin?</div><div><br></div><div>@Thoihen =
: As Nitin has said, when you talk about that much data you don&#39;t actua=
lly just do a &quot;put&quot;. You could use something like &quot;distcp&qu=
ot; for parallel copying. A better approach would be to use a data aggregat=
ion tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses =
their own data aggregation tool, called Scribe for this purpose.</div>


</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" =
target=3D"_blank">cloudfront.blogspot.com</a><br></div></div></div><div><di=
v>

<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:20 PM, Nitin P=
awar <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" targe=
t=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">NN would still be in picture because it will be writing a =
lot of meta data for each individual file. so you will need a NN capable en=
ough which can store the metadata for your entire dataset. Data will never =
go to NN but lot of metadata about data will be on NN so its always good id=
ea to have a strong NN.</div>


<div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quote">On =
Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <span dir=3D"ltr">&lt;<a =
href=3D"mailto:rahul.rec.dgp@gmail.com" target=3D"_blank">rahul.rec.dgp@gma=
il.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:courier new,monospace">@Nitin , parallel dfs to writ=
e to hdfs is great , but could not understand the meaning of capable NN. As=
 I know , the NN would not be a part of the actual data write pipeline , me=
ans that the data would not travel through the NN , the dfs would contact t=
he NN from time to time to get locations of DN as where to store the data b=
locks.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Thanks,<br>Rahul<br></div><div class=3D"gmail_default" style=3D"fon=
t-family:courier new,monospace"><br></div></div><div><div>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, May 1=
1, 2013 at 4:54 PM, Nitin Pawar <span dir=3D"ltr">&lt;<a href=3D"mailto:nit=
inpawar432@gmail.com" target=3D"_blank">nitinpawar432@gmail.com</a>&gt;</sp=
an> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">is it safe? .. there is no =
direct answer yes or no=A0<div><br></div><div>when you say , you have files=
 worth 10TB files and you want to upload =A0to HDFS, several factors come i=
nto picture=A0</div>


<div>
<br></div><div>1) Is the machine in the same network as your hadoop cluster=
?</div><div>2) If there guarantee that network will not go down?</div><div>=
<br></div><div>and Most importantly I assume that you have a capable hadoop=
 cluster. By that I mean you have a capable namenode.=A0</div>


<div><br></div><div>I would definitely not write files=A0sequentially=A0in =
HDFS. I would prefer to write files in parallel=A0to hdfs to utilize the DF=
S write features to speed up the process.=A0</div><div>you can hdfs put com=
mand in parallel manner and in my experience it has not failed when we writ=
e a lot of data.=A0</div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:38 PM, maisnam ns <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:maisnam.ns@gmail.com" target=3D"_blank">maisnam.ns@gmail.com</a=
>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>@Nitin Pawar=
 , thanks for clearing my doubts .<br><br></div>But I have one more questio=
n , say I have 10 TB data in the pipeline .<br>


<br></div>Is it perfectly OK to use hadopo fs put command to upload these f=
iles of size 10 TB and is there any limit to the file size=A0 using hadoop =
command line . Can hadoop put command line work with huge data.<br>
<br></div>Thanks in advance<br></div><div><div><div class=3D"gmail_extra"><=
br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 4:24 PM, Nitin Pa=
war <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=
=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">first of all .. most of the=
 companies do not get 100 PB of data in one go. Its an accumulating process=
 and most of the companies do have a data pipeline in place where the data =
is written to hdfs on a frequency basis and =A0then its retained on hdfs fo=
r some duration as per needed and from there its sent to archivers or delet=
ed.=A0<div>


<br></div><div>For data management products, you can look at falcon which i=
s open sourced by inmobi along with hortonworks.=A0</div><div><br></div><di=
v>In any case, if you want to write files to hdfs there are few options ava=
ilable to you</div>


<div>1) Write your dfs client which writes to dfs</div><div>2) use hdfs pro=
xy</div><div>3) there is webhdfs</div><div>4) command line hdfs</div><div>5=
) data collection tools come with support to write to hdfs like flume etc</=
div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:thoihen123@gmail.com" target=3D"_blank">thoihen123@gmail.co=
m</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>Hi All,<br><br></div>Can anyone help me know how does companies like Fac=
ebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hado=
op HDFS cluster for processing<br>


</div><div>and after processing how they download those files from HDFS to =
local file system.<br></div><div><br></div>I don&#39;t think they might be =
using the command line hadoop fs put to upload files as it would take too l=
ong or do they divide say 10 parts each 10 petabytes and=A0 compress and us=
e the command line hadoop fs put<br>


<br></div>Or if they use any tool to upload huge files.<br><br></div>Please=
 help me .<br><br></div>Thanks<span><font color=3D"#888888"><br></font></sp=
an></div><span><font color=3D"#888888">thoihen<br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>

--047d7bb04ce65fb5b604dc844c58--