Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates
 209.85.212.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO7hTbNkFDZS1GBmgNxnRWkExnwYtiFmSrcDSrRdxQhz247tjw@mail.gmail.com>
References: 
 <CAKY2LB5LDca0Z4avsk79HvrcKwDhwrQzJxb_egceb63rtD2qSw@mail.gmail.com>
 <CAORpBsghbOHe-G+1nrX_L=kVw5fk_fztuHuUUH=_dRHc82LN_A@mail.gmail.com>
 <CAON7oqTHm3FO4Y=Av4pv9Us64TZDQ04ZHDpWZpKd8c7z=R+_GA@mail.gmail.com>
 <CAORpBsiYLT4VCoALA6Ptg4TmZDdbuGEBLfUhkdHfRBXsaB9nTw@mail.gmail.com>
 <CAO7hTbOKv9Zn3GdMsOzRrG1PqkYeg3t77Keu8XKjw87J4OJ0TA@mail.gmail.com>
 <CAORpBsikpMLJO8N4LhapuxDYAS8+65B43gXEMxEr0FdOCeoh4A@mail.gmail.com>
 <CAMVC6RODYKL+oBGf=3z3FeuYuor80fX_tr-wiz-RMPbyd-jyOw@mail.gmail.com>
 <CAORpBshP2_Wv-PiPE6LoWZ0duqv-HYHkkFayc33PcOPHbOMXbg@mail.gmail.com>
 <CAO7hTbMkt6LVhuWqrdkQdqLbXh5PRqde67+TAt4+bAjMNTX-wA@mail.gmail.com>
 <CAMVC6RM4gfChtyWkhibRLQ__9NYzOPqLdw31=TfgZTeheog3LA@mail.gmail.com>
 <CAO7hTbNkFDZS1GBmgNxnRWkExnwYtiFmSrcDSrRdxQhz247tjw@mail.gmail.com>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Sat, 11 May 2013 22:52:24 +0530
Message-ID: 
 <CAMVC6RP3s_7uR4gcRqJPj8QC4OTmE6kBA2sOo=fOj3tQ2XsLgw@mail.gmail.com>
Subject: Re: Hadoop noob question
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7bacbcce46e2eb04dc748985

--047d7bacbcce46e2eb04dc748985
Content-Type: text/plain; charset=ISO-8859-1

You'r welcome :)

Warm Regards,
Tariq
cloudfront.blogspot.com


On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Tariq!
>
>
> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> @Rahul : Yes. distcp can do that.
>>
>> And, bigger the files lesser the metadata hence lesser memory consumption.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> IMHO,I think the statement about NN with regard to block metadata is
>>> more like a general statement. Even if you put lots of small files of
>>> combined size 10 TB , you need to have a capable NN.
>>>
>>> can disct cp be used to copy local - to - hdfs ?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>
>>>> absolutely rite Mohammad
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>>
>>>>> Every file and block in HDFS is treated as an object and for each
>>>>> object around 200B of metadata get created. So the NN should be powerful
>>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>>> Actually memory is the most important metric when it comes to NN.
>>>>>
>>>>> Am I correct @Nitin?
>>>>>
>>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>>> don't actually just do a "put". You could use something like "distcp" for
>>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>>> data aggregation tool, called Scribe for this purpose.
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>>
>>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>>> data for each individual file. so you will need a NN capable enough which
>>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>>> have a strong NN.
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>>> get locations of DN as where to store the data blocks.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <
>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>
>>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>>
>>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>>
>>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>>
>>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>>
>>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>>> to speed up the process.
>>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>>> has not failed when we write a lot of data.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>>
>>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>>> pipeline .
>>>>>>>>>
>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> first of all .. most of the companies do not get 100 PB of data
>>>>>>>>>> in one go. Its an accumulating process and most of the companies do have a
>>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>>
>>>>>>>>>> For data management products, you can look at falcon which is
>>>>>>>>>> open sourced by inmobi along with hortonworks.
>>>>>>>>>>
>>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>>> options available to you
>>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>> 4) command line hdfs
>>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>>> flume etc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>>> cluster for processing
>>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>>> local file system.
>>>>>>>>>>>
>>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>>
>>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>>
>>>>>>>>>>> Please help me .
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> thoihen
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nitin Pawar
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nitin Pawar
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>

--047d7bacbcce46e2eb04dc748985
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">You&#39;r welcome :)</div><div class=3D"gmail_extra"><br c=
lear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><a hre=
f=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blogspot.=
com</a><br>

</div></div></div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 10:46 PM, Rahul =
Bhattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.c=
om" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">Thanks Tariq!<br></div></div><div class=3D"HOEnZb"><div clas=
s=3D"h5"><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On S=
at, May 11, 2013 at 10:34 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a href=
=3D"mailto:dontariq@gmail.com" target=3D"_blank">dontariq@gmail.com</a>&gt;=
</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">@Rahul : Yes. distcp can do=
 that.<div><br></div><div>And, bigger the files lesser the metadata hence l=
esser memory consumption.</div>


</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>

Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" target=3D"_blank=
">cloudfront.blogspot.com</a><br></div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:40 PM, Rahul B=
hattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.co=
m" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">IMHO,I think the statement about NN with regard to block met=
adata is more like a general statement. Even if you put lots of small files=
 of combined size 10 TB , you need to have a capable NN.<br>


<br>can disct cp be used to copy local - to - hdfs ?<br><br></div><div clas=
s=3D"gmail_default" style=3D"font-family:courier new,monospace">Thanks,<br>=
</div><div class=3D"gmail_default" style=3D"font-family:courier new,monospa=
ce">


Rahul<br></div></div><div><div><div class=3D"gmail_extra"><br><br><div clas=
s=3D"gmail_quote">On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <span dir=3D=
"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank">niti=
npawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">absolutely rite Mohammad=A0=
</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">

On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:dontariq@gmail.com" target=3D"_blank">dontariq@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Sorry for barging in guys. =
I think Nitin is talking about this :<div><br></div><div>Every file and blo=
ck in HDFS is treated as an object and for each object around 200B of metad=
ata get created. So the NN should be powerful enough to handle that much me=
tadata, since it is going to be in-memory. Actually memory is the most impo=
rtant metric when it comes to NN.=A0</div>


<div><br></div><div>Am I correct @Nitin?</div><div><br></div><div>@Thoihen =
: As Nitin has said, when you talk about that much data you don&#39;t actua=
lly just do a &quot;put&quot;. You could use something like &quot;distcp&qu=
ot; for parallel copying. A better approach would be to use a data aggregat=
ion tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses =
their own data aggregation tool, called Scribe for this purpose.</div>


</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" =
target=3D"_blank">cloudfront.blogspot.com</a><br></div></div></div><div><di=
v>

<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:20 PM, Nitin P=
awar <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" targe=
t=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">NN would still be in picture because it will be writing a =
lot of meta data for each individual file. so you will need a NN capable en=
ough which can store the metadata for your entire dataset. Data will never =
go to NN but lot of metadata about data will be on NN so its always good id=
ea to have a strong NN.</div>


<div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quote">On =
Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <span dir=3D"ltr">&lt;<a =
href=3D"mailto:rahul.rec.dgp@gmail.com" target=3D"_blank">rahul.rec.dgp@gma=
il.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:courier new,monospace">@Nitin , parallel dfs to writ=
e to hdfs is great , but could not understand the meaning of capable NN. As=
 I know , the NN would not be a part of the actual data write pipeline , me=
ans that the data would not travel through the NN , the dfs would contact t=
he NN from time to time to get locations of DN as where to store the data b=
locks.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Thanks,<br>Rahul<br></div><div class=3D"gmail_default" style=3D"fon=
t-family:courier new,monospace"><br></div></div><div><div>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, May 1=
1, 2013 at 4:54 PM, Nitin Pawar <span dir=3D"ltr">&lt;<a href=3D"mailto:nit=
inpawar432@gmail.com" target=3D"_blank">nitinpawar432@gmail.com</a>&gt;</sp=
an> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">is it safe? .. there is no =
direct answer yes or no=A0<div><br></div><div>when you say , you have files=
 worth 10TB files and you want to upload =A0to HDFS, several factors come i=
nto picture=A0</div>


<div>
<br></div><div>1) Is the machine in the same network as your hadoop cluster=
?</div><div>2) If there guarantee that network will not go down?</div><div>=
<br></div><div>and Most importantly I assume that you have a capable hadoop=
 cluster. By that I mean you have a capable namenode.=A0</div>


<div><br></div><div>I would definitely not write files=A0sequentially=A0in =
HDFS. I would prefer to write files in parallel=A0to hdfs to utilize the DF=
S write features to speed up the process.=A0</div><div>you can hdfs put com=
mand in parallel manner and in my experience it has not failed when we writ=
e a lot of data.=A0</div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:38 PM, maisnam ns <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:maisnam.ns@gmail.com" target=3D"_blank">maisnam.ns@gmail.com</a=
>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>@Nitin Pawar=
 , thanks for clearing my doubts .<br><br></div>But I have one more questio=
n , say I have 10 TB data in the pipeline .<br>


<br></div>Is it perfectly OK to use hadopo fs put command to upload these f=
iles of size 10 TB and is there any limit to the file size=A0 using hadoop =
command line . Can hadoop put command line work with huge data.<br>
<br></div>Thanks in advance<br></div><div><div><div class=3D"gmail_extra"><=
br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 4:24 PM, Nitin Pa=
war <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=
=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">first of all .. most of the=
 companies do not get 100 PB of data in one go. Its an accumulating process=
 and most of the companies do have a data pipeline in place where the data =
is written to hdfs on a frequency basis and =A0then its retained on hdfs fo=
r some duration as per needed and from there its sent to archivers or delet=
ed.=A0<div>


<br></div><div>For data management products, you can look at falcon which i=
s open sourced by inmobi along with hortonworks.=A0</div><div><br></div><di=
v>In any case, if you want to write files to hdfs there are few options ava=
ilable to you</div>


<div>1) Write your dfs client which writes to dfs</div><div>2) use hdfs pro=
xy</div><div>3) there is webhdfs</div><div>4) command line hdfs</div><div>5=
) data collection tools come with support to write to hdfs like flume etc</=
div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:thoihen123@gmail.com" target=3D"_blank">thoihen123@gmail.co=
m</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>Hi All,<br><br></div>Can anyone help me know how does companies like Fac=
ebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hado=
op HDFS cluster for processing<br>


</div><div>and after processing how they download those files from HDFS to =
local file system.<br></div><div><br></div>I don&#39;t think they might be =
using the command line hadoop fs put to upload files as it would take too l=
ong or do they divide say 10 parts each 10 petabytes and=A0 compress and us=
e the command line hadoop fs put<br>


<br></div>Or if they use any tool to upload huge files.<br><br></div>Please=
 help me .<br><br></div>Thanks<span><font color=3D"#888888"><br></font></sp=
an></div><span><font color=3D"#888888">thoihen<br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7bacbcce46e2eb04dc748985--