Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of rahul.rec.dgp@gmail.com
 designates 209.85.128.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMVC6RM4gfChtyWkhibRLQ__9NYzOPqLdw31=TfgZTeheog3LA@mail.gmail.com>
References: 
 <CAKY2LB5LDca0Z4avsk79HvrcKwDhwrQzJxb_egceb63rtD2qSw@mail.gmail.com>
 <CAORpBsghbOHe-G+1nrX_L=kVw5fk_fztuHuUUH=_dRHc82LN_A@mail.gmail.com>
 <CAON7oqTHm3FO4Y=Av4pv9Us64TZDQ04ZHDpWZpKd8c7z=R+_GA@mail.gmail.com>
 <CAORpBsiYLT4VCoALA6Ptg4TmZDdbuGEBLfUhkdHfRBXsaB9nTw@mail.gmail.com>
 <CAO7hTbOKv9Zn3GdMsOzRrG1PqkYeg3t77Keu8XKjw87J4OJ0TA@mail.gmail.com>
 <CAORpBsikpMLJO8N4LhapuxDYAS8+65B43gXEMxEr0FdOCeoh4A@mail.gmail.com>
 <CAMVC6RODYKL+oBGf=3z3FeuYuor80fX_tr-wiz-RMPbyd-jyOw@mail.gmail.com>
 <CAORpBshP2_Wv-PiPE6LoWZ0duqv-HYHkkFayc33PcOPHbOMXbg@mail.gmail.com>
 <CAO7hTbMkt6LVhuWqrdkQdqLbXh5PRqde67+TAt4+bAjMNTX-wA@mail.gmail.com>
 <CAMVC6RM4gfChtyWkhibRLQ__9NYzOPqLdw31=TfgZTeheog3LA@mail.gmail.com>
From: Rahul Bhattacharjee <rahul.rec.dgp@gmail.com>
Date: Sat, 11 May 2013 22:46:45 +0530
Message-ID: 
 <CAO7hTbNkFDZS1GBmgNxnRWkExnwYtiFmSrcDSrRdxQhz247tjw@mail.gmail.com>
Subject: Re: Hadoop noob question
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=089e01633aa8de366a04dc74737f

--089e01633aa8de366a04dc74737f
Content-Type: text/plain; charset=UTF-8

Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> @Rahul : Yes. distcp can do that.
>
> And, bigger the files lesser the metadata hence lesser memory consumption.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> IMHO,I think the statement about NN with regard to block metadata is more
>> like a general statement. Even if you put lots of small files of combined
>> size 10 TB , you need to have a capable NN.
>>
>> can disct cp be used to copy local - to - hdfs ?
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>
>>> absolutely rite Mohammad
>>>
>>>
>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> Sorry for barging in guys. I think Nitin is talking about this :
>>>>
>>>> Every file and block in HDFS is treated as an object and for each
>>>> object around 200B of metadata get created. So the NN should be powerful
>>>> enough to handle that much metadata, since it is going to be in-memory.
>>>> Actually memory is the most important metric when it comes to NN.
>>>>
>>>> Am I correct @Nitin?
>>>>
>>>> @Thoihen : As Nitin has said, when you talk about that much data you
>>>> don't actually just do a "put". You could use something like "distcp" for
>>>> parallel copying. A better approach would be to use a data aggregation tool
>>>> like Flume or Chukwa, as Nitin has already pointed. Facebook uses their own
>>>> data aggregation tool, called Scribe for this purpose.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> NN would still be in picture because it will be writing a lot of meta
>>>>> data for each individual file. so you will need a NN capable enough which
>>>>> can store the metadata for your entire dataset. Data will never go to NN
>>>>> but lot of metadata about data will be on NN so its always good idea to
>>>>> have a strong NN.
>>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not
>>>>>> understand the meaning of capable NN. As I know , the NN would not be a
>>>>>> part of the actual data write pipeline , means that the data would not
>>>>>> travel through the NN , the dfs would contact the NN from time to time to
>>>>>> get locations of DN as where to store the data blocks.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> is it safe? .. there is no direct answer yes or no
>>>>>>>
>>>>>>> when you say , you have files worth 10TB files and you want to
>>>>>>> upload  to HDFS, several factors come into picture
>>>>>>>
>>>>>>> 1) Is the machine in the same network as your hadoop cluster?
>>>>>>> 2) If there guarantee that network will not go down?
>>>>>>>
>>>>>>> and Most importantly I assume that you have a capable hadoop
>>>>>>> cluster. By that I mean you have a capable namenode.
>>>>>>>
>>>>>>> I would definitely not write files sequentially in HDFS. I would
>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features
>>>>>>> to speed up the process.
>>>>>>> you can hdfs put command in parallel manner and in my experience it
>>>>>>> has not failed when we write a lot of data.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com>wrote:
>>>>>>>
>>>>>>>> @Nitin Pawar , thanks for clearing my doubts .
>>>>>>>>
>>>>>>>> But I have one more question , say I have 10 TB data in the
>>>>>>>> pipeline .
>>>>>>>>
>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these
>>>>>>>> files of size 10 TB and is there any limit to the file size  using hadoop
>>>>>>>> command line . Can hadoop put command line work with huge data.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar <
>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> first of all .. most of the companies do not get 100 PB of data in
>>>>>>>>> one go. Its an accumulating process and most of the companies do have a
>>>>>>>>> data pipeline in place where the data is written to hdfs on a frequency
>>>>>>>>> basis and  then its retained on hdfs for some duration as per needed and
>>>>>>>>> from there its sent to archivers or deleted.
>>>>>>>>>
>>>>>>>>> For data management products, you can look at falcon which is open
>>>>>>>>> sourced by inmobi along with hortonworks.
>>>>>>>>>
>>>>>>>>> In any case, if you want to write files to hdfs there are few
>>>>>>>>> options available to you
>>>>>>>>> 1) Write your dfs client which writes to dfs
>>>>>>>>> 2) use hdfs proxy
>>>>>>>>> 3) there is webhdfs
>>>>>>>>> 4) command line hdfs
>>>>>>>>> 5) data collection tools come with support to write to hdfs like
>>>>>>>>> flume etc
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <
>>>>>>>>> thoihen123@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Can anyone help me know how does companies like Facebook ,Yahoo
>>>>>>>>>> etc upload bulk files say to the tune of 100 petabytes to Hadoop HDFS
>>>>>>>>>> cluster for processing
>>>>>>>>>> and after processing how they download those files from HDFS to
>>>>>>>>>> local file system.
>>>>>>>>>>
>>>>>>>>>> I don't think they might be using the command line hadoop fs put
>>>>>>>>>> to upload files as it would take too long or do they divide say 10 parts
>>>>>>>>>> each 10 petabytes and  compress and use the command line hadoop fs put
>>>>>>>>>>
>>>>>>>>>> Or if they use any tool to upload huge files.
>>>>>>>>>>
>>>>>>>>>> Please help me .
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> thoihen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nitin Pawar
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

--089e01633aa8de366a04dc74737f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace;color:rgb(0,0,0)">Thanks Tariq!<br></div></div><div class=3D"=
gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 10:=
34 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmai=
l.com" target=3D"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">@Rahul : Yes. distcp can do=
 that.<div><br></div><div>And, bigger the files lesser the metadata hence l=
esser memory consumption.</div>

</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>

Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" target=3D"_blank=
">cloudfront.blogspot.com</a><br></div></div></div><div><div class=3D"h5">
<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:40 PM, Rahul B=
hattacharjee <span dir=3D"ltr">&lt;<a href=3D"mailto:rahul.rec.dgp@gmail.co=
m" target=3D"_blank">rahul.rec.dgp@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:courier =
new,monospace">IMHO,I think the statement about NN with regard to block met=
adata is more like a general statement. Even if you put lots of small files=
 of combined size 10 TB , you need to have a capable NN.<br>


<br>can disct cp be used to copy local - to - hdfs ?<br><br></div><div clas=
s=3D"gmail_default" style=3D"font-family:courier new,monospace">Thanks,<br>=
</div><div class=3D"gmail_default" style=3D"font-family:courier new,monospa=
ce">


Rahul<br></div></div><div><div><div class=3D"gmail_extra"><br><br><div clas=
s=3D"gmail_quote">On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <span dir=3D=
"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank">niti=
npawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">absolutely rite Mohammad=C2=
=A0</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_q=
uote">

On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:dontariq@gmail.com" target=3D"_blank">dontariq@gmail.com</a>&g=
t;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Sorry for barging in guys. =
I think Nitin is talking about this :<div><br></div><div>Every file and blo=
ck in HDFS is treated as an object and for each object around 200B of metad=
ata get created. So the NN should be powerful enough to handle that much me=
tadata, since it is going to be in-memory. Actually memory is the most impo=
rtant metric when it comes to NN.=C2=A0</div>


<div><br></div><div>Am I correct @Nitin?</div><div><br></div><div>@Thoihen =
: As Nitin has said, when you talk about that much data you don&#39;t actua=
lly just do a &quot;put&quot;. You could use something like &quot;distcp&qu=
ot; for parallel copying. A better approach would be to use a data aggregat=
ion tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses =
their own data aggregation tool, called Scribe for this purpose.</div>


</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr">Wa=
rm Regards,<div>Tariq</div><div><a href=3D"http://cloudfront.blogspot.com" =
target=3D"_blank">cloudfront.blogspot.com</a><br></div></div></div><div><di=
v>

<br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 9:20 PM, Nitin P=
awar <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" targe=
t=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr">NN would still be in picture because it will be writing a =
lot of meta data for each individual file. so you will need a NN capable en=
ough which can store the metadata for your entire dataset. Data will never =
go to NN but lot of metadata about data will be on NN so its always good id=
ea to have a strong NN.</div>


<div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quote">On =
Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <span dir=3D"ltr">&lt;<a =
href=3D"mailto:rahul.rec.dgp@gmail.com" target=3D"_blank">rahul.rec.dgp@gma=
il.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:courier new,monospace">@Nitin , parallel dfs to writ=
e to hdfs is great , but could not understand the meaning of capable NN. As=
 I know , the NN would not be a part of the actual data write pipeline , me=
ans that the data would not travel through the NN , the dfs would contact t=
he NN from time to time to get locations of DN as where to store the data b=
locks.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:courier new,mon=
ospace">Thanks,<br>Rahul<br></div><div class=3D"gmail_default" style=3D"fon=
t-family:courier new,monospace"><br></div></div><div><div>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Sat, May 1=
1, 2013 at 4:54 PM, Nitin Pawar <span dir=3D"ltr">&lt;<a href=3D"mailto:nit=
inpawar432@gmail.com" target=3D"_blank">nitinpawar432@gmail.com</a>&gt;</sp=
an> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">is it safe? .. there is no =
direct answer yes or no=C2=A0<div><br></div><div>when you say , you have fi=
les worth 10TB files and you want to upload =C2=A0to HDFS, several factors =
come into picture=C2=A0</div>


<div>
<br></div><div>1) Is the machine in the same network as your hadoop cluster=
?</div><div>2) If there guarantee that network will not go down?</div><div>=
<br></div><div>and Most importantly I assume that you have a capable hadoop=
 cluster. By that I mean you have a capable namenode.=C2=A0</div>


<div><br></div><div>I would definitely not write files=C2=A0sequentially=C2=
=A0in HDFS. I would prefer to write files in parallel=C2=A0to hdfs to utili=
ze the DFS write features to speed up the process.=C2=A0</div><div>you can =
hdfs put command in parallel manner and in my experience it has not failed =
when we write a lot of data.=C2=A0</div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:38 PM, maisnam ns <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:maisnam.ns@gmail.com" target=3D"_blank">maisnam.ns@gmail.com</a=
>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>@Nitin Pawar=
 , thanks for clearing my doubts .<br><br></div>But I have one more questio=
n , say I have 10 TB data in the pipeline .<br>


<br></div>Is it perfectly OK to use hadopo fs put command to upload these f=
iles of size 10 TB and is there any limit to the file size=C2=A0 using hado=
op command line . Can hadoop put command line work with huge data.<br>
<br></div>Thanks in advance<br></div><div><div><div class=3D"gmail_extra"><=
br><br><div class=3D"gmail_quote">On Sat, May 11, 2013 at 4:24 PM, Nitin Pa=
war <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=
=3D"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">first of all .. most of the=
 companies do not get 100 PB of data in one go. Its an accumulating process=
 and most of the companies do have a data pipeline in place where the data =
is written to hdfs on a frequency basis and =C2=A0then its retained on hdfs=
 for some duration as per needed and from there its sent to archivers or de=
leted.=C2=A0<div>


<br></div><div>For data management products, you can look at falcon which i=
s open sourced by inmobi along with hortonworks.=C2=A0</div><div><br></div>=
<div>In any case, if you want to write files to hdfs there are few options =
available to you</div>


<div>1) Write your dfs client which writes to dfs</div><div>2) use hdfs pro=
xy</div><div>3) there is webhdfs</div><div>4) command line hdfs</div><div>5=
) data collection tools come with support to write to hdfs like flume etc</=
div>


</div><div class=3D"gmail_extra"><div><div><br><br><div class=3D"gmail_quot=
e">On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:thoihen123@gmail.com" target=3D"_blank">thoihen123@gmail.co=
m</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv>Hi All,<br><br></div>Can anyone help me know how does companies like Fac=
ebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hado=
op HDFS cluster for processing<br>


</div><div>and after processing how they download those files from HDFS to =
local file system.<br></div><div><br></div>I don&#39;t think they might be =
using the command line hadoop fs put to upload files as it would take too l=
ong or do they divide say 10 parts each 10 petabytes and=C2=A0 compress and=
 use the command line hadoop fs put<br>


<br></div>Or if they use any tool to upload huge files.<br><br></div>Please=
 help me .<br><br></div>Thanks<span><font color=3D"#888888"><br></font></sp=
an></div><span><font color=3D"#888888">thoihen<br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br>Nitin Pawar<br>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--089e01633aa8de366a04dc74737f--