Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63D7A11CB3 for ; Sun, 12 May 2013 12:11:53 +0000 (UTC) Received: (qmail 7275 invoked by uid 500); 12 May 2013 12:11:48 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 7167 invoked by uid 500); 12 May 2013 12:11:47 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 7155 invoked by uid 99); 12 May 2013 12:11:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 May 2013 12:11:47 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dontariq@gmail.com designates 209.85.128.173 as permitted sender) Received: from [209.85.128.173] (HELO mail-ve0-f173.google.com) (209.85.128.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 May 2013 12:11:42 +0000 Received: by mail-ve0-f173.google.com with SMTP id cy12so1619538veb.18 for ; Sun, 12 May 2013 05:11:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=T76yqc/XO/xNPJiaUvL+zXnx8tmHVhCkxVQkT6BYMzA=; b=ahtIs5pdEH9scIIt72f+2GLTK1iHv3+LVQt0M3fvIFzypCBl/oNAIdg8gxEW8KVKYx V9Qqdt22DqLCsxtgod0rHx3qooXN5uwYICx7kOrr1Y6amNAZsuYLxe7/M2RgBvC4wNBi 6Z/pEpRfSD4xO+n7am4i8EEzbNBYEMoTIyZL6/nX2/qESFIG4/qPZriJmjlvZX0kPFm7 PqfJeCDvQsjfouTuv4IWMfPni459wWRS6FWf0FGB4+Mp7H1J5XBebxR+25jj+uVIfsJc NH3FF49Fx2JHxJX0KF9BmgwEx8jJa8H+VW/XZX9FQyQQd5+hItdNUbKHYvvHeIc0WsE0 YKxg== X-Received: by 10.59.2.199 with SMTP id bq7mr15580580ved.51.1368360681975; Sun, 12 May 2013 05:11:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.152.36 with HTTP; Sun, 12 May 2013 05:10:40 -0700 (PDT) In-Reply-To: References: From: Mohammad Tariq Date: Sun, 12 May 2013 17:40:40 +0530 Message-ID: Subject: Re: Hadoop noob question To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=047d7bb04ce65fb5b604dc844c58 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bb04ce65fb5b604dc844c58 Content-Type: text/plain; charset=ISO-8859-1 @Rahul : I'm sorry I answered this on a wrong thread by mistake. You could do that as Nitin has shown. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar wrote: > you can do that using file:/// > > example: > > hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/ > > > > On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee < > rahul.rec.dgp@gmail.com> wrote: > >> @Tariq can you point me to some resource which shows how distcp is used >> to upload files from local to hdfs. >> >> isn't distcp a MR job ? wouldn't it need the data to be already present >> in the hadoop's fs? >> >> Rahul >> >> >> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq wrote: >> >>> You'r welcome :) >>> >>> Warm Regards, >>> Tariq >>> cloudfront.blogspot.com >>> >>> >>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee < >>> rahul.rec.dgp@gmail.com> wrote: >>> >>>> Thanks Tariq! >>>> >>>> >>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq wrote: >>>> >>>>> @Rahul : Yes. distcp can do that. >>>>> >>>>> And, bigger the files lesser the metadata hence lesser memory >>>>> consumption. >>>>> >>>>> Warm Regards, >>>>> Tariq >>>>> cloudfront.blogspot.com >>>>> >>>>> >>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee < >>>>> rahul.rec.dgp@gmail.com> wrote: >>>>> >>>>>> IMHO,I think the statement about NN with regard to block metadata is >>>>>> more like a general statement. Even if you put lots of small files of >>>>>> combined size 10 TB , you need to have a capable NN. >>>>>> >>>>>> can disct cp be used to copy local - to - hdfs ? >>>>>> >>>>>> Thanks, >>>>>> Rahul >>>>>> >>>>>> >>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar >>>>> > wrote: >>>>>> >>>>>>> absolutely rite Mohammad >>>>>>> >>>>>>> >>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq wrote: >>>>>>> >>>>>>>> Sorry for barging in guys. I think Nitin is talking about this : >>>>>>>> >>>>>>>> Every file and block in HDFS is treated as an object and for each >>>>>>>> object around 200B of metadata get created. So the NN should be powerful >>>>>>>> enough to handle that much metadata, since it is going to be in-memory. >>>>>>>> Actually memory is the most important metric when it comes to NN. >>>>>>>> >>>>>>>> Am I correct @Nitin? >>>>>>>> >>>>>>>> @Thoihen : As Nitin has said, when you talk about that much data >>>>>>>> you don't actually just do a "put". You could use something like "distcp" >>>>>>>> for parallel copying. A better approach would be to use a data aggregation >>>>>>>> tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses >>>>>>>> their own data aggregation tool, called Scribe for this purpose. >>>>>>>> >>>>>>>> Warm Regards, >>>>>>>> Tariq >>>>>>>> cloudfront.blogspot.com >>>>>>>> >>>>>>>> >>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar < >>>>>>>> nitinpawar432@gmail.com> wrote: >>>>>>>> >>>>>>>>> NN would still be in picture because it will be writing a lot of >>>>>>>>> meta data for each individual file. so you will need a NN capable enough >>>>>>>>> which can store the metadata for your entire dataset. Data will never go to >>>>>>>>> NN but lot of metadata about data will be on NN so its always good idea to >>>>>>>>> have a strong NN. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee < >>>>>>>>> rahul.rec.dgp@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> @Nitin , parallel dfs to write to hdfs is great , but could not >>>>>>>>>> understand the meaning of capable NN. As I know , the NN would not be a >>>>>>>>>> part of the actual data write pipeline , means that the data would not >>>>>>>>>> travel through the NN , the dfs would contact the NN from time to time to >>>>>>>>>> get locations of DN as where to store the data blocks. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Rahul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM, Nitin Pawar < >>>>>>>>>> nitinpawar432@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> is it safe? .. there is no direct answer yes or no >>>>>>>>>>> >>>>>>>>>>> when you say , you have files worth 10TB files and you want to >>>>>>>>>>> upload to HDFS, several factors come into picture >>>>>>>>>>> >>>>>>>>>>> 1) Is the machine in the same network as your hadoop cluster? >>>>>>>>>>> 2) If there guarantee that network will not go down? >>>>>>>>>>> >>>>>>>>>>> and Most importantly I assume that you have a capable hadoop >>>>>>>>>>> cluster. By that I mean you have a capable namenode. >>>>>>>>>>> >>>>>>>>>>> I would definitely not write files sequentially in HDFS. I would >>>>>>>>>>> prefer to write files in parallel to hdfs to utilize the DFS write features >>>>>>>>>>> to speed up the process. >>>>>>>>>>> you can hdfs put command in parallel manner and in my experience >>>>>>>>>>> it has not failed when we write a lot of data. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, May 11, 2013 at 4:38 PM, maisnam ns < >>>>>>>>>>> maisnam.ns@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> @Nitin Pawar , thanks for clearing my doubts . >>>>>>>>>>>> >>>>>>>>>>>> But I have one more question , say I have 10 TB data in the >>>>>>>>>>>> pipeline . >>>>>>>>>>>> >>>>>>>>>>>> Is it perfectly OK to use hadopo fs put command to upload these >>>>>>>>>>>> files of size 10 TB and is there any limit to the file size using hadoop >>>>>>>>>>>> command line . Can hadoop put command line work with huge data. >>>>>>>>>>>> >>>>>>>>>>>> Thanks in advance >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 11, 2013 at 4:24 PM, Nitin Pawar < >>>>>>>>>>>> nitinpawar432@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> first of all .. most of the companies do not get 100 PB of >>>>>>>>>>>>> data in one go. Its an accumulating process and most of the companies do >>>>>>>>>>>>> have a data pipeline in place where the data is written to hdfs on a >>>>>>>>>>>>> frequency basis and then its retained on hdfs for some duration as per >>>>>>>>>>>>> needed and from there its sent to archivers or deleted. >>>>>>>>>>>>> >>>>>>>>>>>>> For data management products, you can look at falcon which is >>>>>>>>>>>>> open sourced by inmobi along with hortonworks. >>>>>>>>>>>>> >>>>>>>>>>>>> In any case, if you want to write files to hdfs there are few >>>>>>>>>>>>> options available to you >>>>>>>>>>>>> 1) Write your dfs client which writes to dfs >>>>>>>>>>>>> 2) use hdfs proxy >>>>>>>>>>>>> 3) there is webhdfs >>>>>>>>>>>>> 4) command line hdfs >>>>>>>>>>>>> 5) data collection tools come with support to write to hdfs >>>>>>>>>>>>> like flume etc >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam < >>>>>>>>>>>>> thoihen123@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can anyone help me know how does companies like Facebook >>>>>>>>>>>>>> ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hadoop >>>>>>>>>>>>>> HDFS cluster for processing >>>>>>>>>>>>>> and after processing how they download those files from HDFS >>>>>>>>>>>>>> to local file system. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't think they might be using the command line hadoop fs >>>>>>>>>>>>>> put to upload files as it would take too long or do they divide say 10 >>>>>>>>>>>>>> parts each 10 petabytes and compress and use the command line hadoop fs put >>>>>>>>>>>>>> >>>>>>>>>>>>>> Or if they use any tool to upload huge files. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please help me . >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> thoihen >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Nitin Pawar >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Nitin Pawar >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > Nitin Pawar > --047d7bb04ce65fb5b604dc844c58 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
@Rahul : I'm sorry I answered this on a wrong thread b= y mistake. You could do that as Nitin has shown.

Warm Regards,
Tariq
=


On Sun, May 12, 2013 at 5:36 PM, Nitin P= awar <nitinpawar432@gmail.com> wrote:
you can do that using file:///=A0

examp= le:=A0


hadoop distcp hdfs://localhost:8020/somef=
ile file:///Users/myhome/Desktop/

On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattac= harjee <rahul.rec.dgp@gmail.com> wrote:
@Tariq can you point me to so= me resource which shows how distcp is used to upload files from local to hd= fs.

is= n't distcp a MR job ? wouldn't it need the data to be already prese= nt in the hadoop's fs?

Rahul

On Sat, May 11, 2013 at 10:52 PM, Mohammad = Tariq <dontariq@gmail.com> wrote:
You'r welcome :)
<= div class=3D"gmail_extra">
Warm Rega= rds,
Tariq


On Sat, May 11, 2013 at 10:46 PM, Rahul = Bhattacharjee <rahul.rec.dgp@gmail.com> wrote:
Thanks Tariq!


On Sat, May 11, 2013 at 10:34 PM, = Mohammad Tariq <dontariq@gmail.com> wrote:
@Rahul : Yes. distcp can do= that.

And, bigger the files lesser the metadata hence l= esser memory consumption.

Wa= rm Regards,
Tariq


On Sat, May 11, 2013 at 9:40 PM, Rahul B= hattacharjee <rahul.rec.dgp@gmail.com> wrote:
IMHO,I think the statement about NN with regard to block met= adata is more like a general statement. Even if you put lots of small files= of combined size 10 TB , you need to have a capable NN.

can disct cp be used to copy local - to - hdfs ?

Thanks,
=
Rahul


On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <niti= npawar432@gmail.com> wrote:
absolutely rite Mohammad=A0=


On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq <dontariq@gmail.com&g= t; wrote:
Sorry for barging in guys. = I think Nitin is talking about this :

Every file and blo= ck in HDFS is treated as an object and for each object around 200B of metad= ata get created. So the NN should be powerful enough to handle that much me= tadata, since it is going to be in-memory. Actually memory is the most impo= rtant metric when it comes to NN.=A0

Am I correct @Nitin?

@Thoihen = : As Nitin has said, when you talk about that much data you don't actua= lly just do a "put". You could use something like "distcp&qu= ot; for parallel copying. A better approach would be to use a data aggregat= ion tool like Flume or Chukwa, as Nitin has already pointed. Facebook uses = their own data aggregation tool, called Scribe for this purpose.

Wa= rm Regards,
Tariq


On Sat, May 11, 2013 at 9:20 PM, Nitin P= awar <nitinpawar432@gmail.com> wrote:
NN would still be in picture because it will be writing a = lot of meta data for each individual file. so you will need a NN capable en= ough which can store the metadata for your entire dataset. Data will never = go to NN but lot of metadata about data will be on NN so its always good id= ea to have a strong NN.


On = Sat, May 11, 2013 at 9:11 PM, Rahul Bhattacharjee <rahul.rec.dgp@gma= il.com> wrote:
@Nitin , parallel dfs to writ= e to hdfs is great , but could not understand the meaning of capable NN. As= I know , the NN would not be a part of the actual data write pipeline , me= ans that the data would not travel through the NN , the dfs would contact t= he NN from time to time to get locations of DN as where to store the data b= locks.

Thanks,
Rahul



On Sat, May 1= 1, 2013 at 4:54 PM, Nitin Pawar <nitinpawar432@gmail.com> wrote:
is it safe? .. there is no = direct answer yes or no=A0

when you say , you have files= worth 10TB files and you want to upload =A0to HDFS, several factors come i= nto picture=A0

1) Is the machine in the same network as your hadoop cluster= ?
2) If there guarantee that network will not go down?
=
and Most importantly I assume that you have a capable hadoop= cluster. By that I mean you have a capable namenode.=A0

I would definitely not write files=A0sequentially=A0in = HDFS. I would prefer to write files in parallel=A0to hdfs to utilize the DF= S write features to speed up the process.=A0
you can hdfs put com= mand in parallel manner and in my experience it has not failed when we writ= e a lot of data.=A0


On Sat, May 11, 2013 at 4:38 PM, maisnam ns <maisnam.ns@gmail.com> wrote:
@Nitin Pawar= , thanks for clearing my doubts .

But I have one more questio= n , say I have 10 TB data in the pipeline .

Is it perfectly OK to use hadopo fs put command to upload these f= iles of size 10 TB and is there any limit to the file size=A0 using hadoop = command line . Can hadoop put command line work with huge data.

Thanks in advance
<= br>
On Sat, May 11, 2013 at 4:24 PM, Nitin Pa= war <nitinpawar432@gmail.com> wrote:
first of all .. most of the= companies do not get 100 PB of data in one go. Its an accumulating process= and most of the companies do have a data pipeline in place where the data = is written to hdfs on a frequency basis and =A0then its retained on hdfs fo= r some duration as per needed and from there its sent to archivers or delet= ed.=A0

For data management products, you can look at falcon which i= s open sourced by inmobi along with hortonworks.=A0

In any case, if you want to write files to hdfs there are few options ava= ilable to you
1) Write your dfs client which writes to dfs
2) use hdfs pro= xy
3) there is webhdfs
4) command line hdfs
5= ) data collection tools come with support to write to hdfs like flume etc


On Sat, May 11, 2013 at 4:19 PM, Thoihen Maibam <thoihen123@gmail.co= m> wrote:
Hi All,

Can anyone help me know how does companies like Fac= ebook ,Yahoo etc upload bulk files say to the tune of 100 petabytes to Hado= op HDFS cluster for processing
and after processing how they download those files from HDFS to = local file system.

I don't think they might be = using the command line hadoop fs put to upload files as it would take too l= ong or do they divide say 10 parts each 10 petabytes and=A0 compress and us= e the command line hadoop fs put

Or if they use any tool to upload huge files.

Please= help me .

Thanks
thoihen



<= font color=3D"#888888">--
Nitin Pawar




<= /div>--
Nitin Pawar




<= /div>--
Nitin Pawar




<= font color=3D"#888888">--
Nitin Pawar








<= /div>--
Nitin Pawar

--047d7bb04ce65fb5b604dc844c58--