hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mawata <chris.maw...@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 14:21:19 GMT
It is being read sequentially but is it not potentially being written on 
multiple drives and since reading is typically faster than writing don't 
you still get a little benefit of parallelism?

On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
> I had said that if you use distcp to copy data *from localFS to HDFS* 
> then you won't be able to exploit parallelism as entire file is 
> present on a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee 
> <rahul.rec.dgp@gmail.com <mailto:rahul.rec.dgp@gmail.com>> wrote:
>
>     Yes , it's a MR job under the hood . my question was that you
>     wrote that using distcp you loose the benefits  of parallel
>     processing of Hadoop. I think the MR job of distcp divides files
>     into individual map tasks based on the total size of the transfer
>     , so multiple mappers would still be spawned if the size of
>     transfer is huge and they would work in parallel.
>
>     Correct me if there is anything wrong!
>
>     Thanks,
>     Rahul
>
>
>     On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq
>     <dontariq@gmail.com <mailto:dontariq@gmail.com>> wrote:
>
>         No. distcp is actually a mapreduce job under the hood.
>
>         Warm Regards,
>         Tariq
>         cloudfront.blogspot.com <http://cloudfront.blogspot.com>
>
>
>         On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee
>         <rahul.rec.dgp@gmail.com <mailto:rahul.rec.dgp@gmail.com>> wrote:
>
>             Thanks to both of you!
>
>             Rahul
>
>
>             On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar
>             <nitinpawar432@gmail.com <mailto:nitinpawar432@gmail.com>>
>             wrote:
>
>                 you can do that using file:///
>
>                 example:
>
>                 |hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>
>
>
>
>
>
>
>
>
>
>
>
>                 |
>
>
>
>                 On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee
>                 <rahul.rec.dgp@gmail.com
>                 <mailto:rahul.rec.dgp@gmail.com>> wrote:
>
>                     @Tariq can you point me to some resource which
>                     shows how distcp is used to upload files from
>                     local to hdfs.
>
>                     isn't distcp a MR job ? wouldn't it need the data
>                     to be already present in the hadoop's fs?
>
>                     Rahul
>
>
>                     On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq
>                     <dontariq@gmail.com <mailto:dontariq@gmail.com>>
>                     wrote:
>
>                         You'r welcome :)
>
>                         Warm Regards,
>                         Tariq
>                         cloudfront.blogspot.com
>                         <http://cloudfront.blogspot.com>
>
>
>                         On Sat, May 11, 2013 at 10:46 PM, Rahul
>                         Bhattacharjee <rahul.rec.dgp@gmail.com
>                         <mailto:rahul.rec.dgp@gmail.com>> wrote:
>
>                             Thanks Tariq!
>
>
>                             On Sat, May 11, 2013 at 10:34 PM, Mohammad
>                             Tariq <dontariq@gmail.com
>                             <mailto:dontariq@gmail.com>> wrote:
>
>                                 @Rahul : Yes. distcp can do that.
>
>                                 And, bigger the files lesser the
>                                 metadata hence lesser memory consumption.
>
>                                 Warm Regards,
>                                 Tariq
>                                 cloudfront.blogspot.com
>                                 <http://cloudfront.blogspot.com>
>
>
>                                 On Sat, May 11, 2013 at 9:40 PM, Rahul
>                                 Bhattacharjee <rahul.rec.dgp@gmail.com
>                                 <mailto:rahul.rec.dgp@gmail.com>> wrote:
>
>                                     IMHO,I think the statement about
>                                     NN with regard to block metadata
>                                     is more like a general statement.
>                                     Even if you put lots of small
>                                     files of combined size 10 TB , you
>                                     need to have a capable NN.
>
>                                     can disct cp be used to copy local
>                                     - to - hdfs ?
>
>                                     Thanks,
>                                     Rahul
>
>
>                                     On Sat, May 11, 2013 at 9:35 PM,
>                                     Nitin Pawar
>                                     <nitinpawar432@gmail.com
>                                     <mailto:nitinpawar432@gmail.com>>
>                                     wrote:
>
>                                         absolutely rite Mohammad
>
>
>                                         On Sat, May 11, 2013 at 9:33
>                                         PM, Mohammad Tariq
>                                         <dontariq@gmail.com
>                                         <mailto:dontariq@gmail.com>>
>                                         wrote:
>
>                                             Sorry for barging in guys.
>                                             I think Nitin is talking
>                                             about this :
>
>                                             Every file and block in
>                                             HDFS is treated as an
>                                             object and for each object
>                                             around 200B of metadata
>                                             get created. So the NN
>                                             should be powerful enough
>                                             to handle that much
>                                             metadata, since it is
>                                             going to be in-memory.
>                                             Actually memory is the
>                                             most important metric when
>                                             it comes to NN.
>
>                                             Am I correct @Nitin?
>
>                                             @Thoihen : As Nitin has
>                                             said, when you talk about
>                                             that much data you don't
>                                             actually just do a "put".
>                                             You could use something
>                                             like "distcp" for parallel
>                                             copying. A better approach
>                                             would be to use a data
>                                             aggregation tool like
>                                             Flume or Chukwa, as Nitin
>                                             has already pointed.
>                                             Facebook uses their own
>                                             data aggregation tool,
>                                             called Scribe for this
>                                             purpose.
>
>                                             Warm Regards,
>                                             Tariq
>                                             cloudfront.blogspot.com
>                                             <http://cloudfront.blogspot.com>
>
>
>                                             On Sat, May 11, 2013 at
>                                             9:20 PM, Nitin Pawar
>                                             <nitinpawar432@gmail.com
>                                             <mailto:nitinpawar432@gmail.com>>
>                                             wrote:
>
>                                                 NN would still be in
>                                                 picture because it
>                                                 will be writing a lot
>                                                 of meta data for each
>                                                 individual file. so
>                                                 you will need a NN
>                                                 capable enough which
>                                                 can store the metadata
>                                                 for your entire
>                                                 dataset. Data will
>                                                 never go to NN but lot
>                                                 of metadata about data
>                                                 will be on NN so its
>                                                 always good idea to
>                                                 have a strong NN.
>
>
>                                                 On Sat, May 11, 2013
>                                                 at 9:11 PM, Rahul
>                                                 Bhattacharjee
>                                                 <rahul.rec.dgp@gmail.com
>                                                 <mailto:rahul.rec.dgp@gmail.com>>
>                                                 wrote:
>
>                                                     @Nitin , parallel
>                                                     dfs to write to
>                                                     hdfs is great ,
>                                                     but could not
>                                                     understand the
>                                                     meaning of capable
>                                                     NN. As I know ,
>                                                     the NN would not
>                                                     be a part of the
>                                                     actual data write
>                                                     pipeline , means
>                                                     that the data
>                                                     would not travel
>                                                     through the NN ,
>                                                     the dfs would
>                                                     contact the NN
>                                                     from time to time
>                                                     to get locations
>                                                     of DN as where to
>                                                     store the data blocks.
>
>                                                     Thanks,
>                                                     Rahul
>
>
>
>                                                     On Sat, May 11,
>                                                     2013 at 4:54 PM,
>                                                     Nitin Pawar
>                                                     <nitinpawar432@gmail.com
>                                                     <mailto:nitinpawar432@gmail.com>>
>                                                     wrote:
>
>                                                         is it safe? ..
>                                                         there is no
>                                                         direct answer
>                                                         yes or no
>
>                                                         when you say ,
>                                                         you have files
>                                                         worth 10TB
>                                                         files and you
>                                                         want to upload
>                                                          to HDFS,
>                                                         several
>                                                         factors come
>                                                         into picture
>
>                                                         1) Is the
>                                                         machine in the
>                                                         same network
>                                                         as your hadoop
>                                                         cluster?
>                                                         2) If there
>                                                         guarantee that
>                                                         network will
>                                                         not go down?
>
>                                                         and Most
>                                                         importantly I
>                                                         assume that
>                                                         you have a
>                                                         capable hadoop
>                                                         cluster. By
>                                                         that I mean
>                                                         you have a
>                                                         capable namenode.
>
>                                                         I would
>                                                         definitely not
>                                                         write
>                                                         files sequentially in
>                                                         HDFS. I would
>                                                         prefer to
>                                                         write files in
>                                                         parallel to
>                                                         hdfs to
>                                                         utilize the
>                                                         DFS write
>                                                         features to
>                                                         speed up the
>                                                         process.
>                                                         you can hdfs
>                                                         put command in
>                                                         parallel
>                                                         manner and in
>                                                         my experience
>                                                         it has not
>                                                         failed when we
>                                                         write a lot of
>                                                         data.
>
>
>                                                         On Sat, May
>                                                         11, 2013 at
>                                                         4:38 PM,
>                                                         maisnam ns
>                                                         <maisnam.ns@gmail.com
>                                                         <mailto:maisnam.ns@gmail.com>>
>                                                         wrote:
>
>                                                             @Nitin
>                                                             Pawar ,
>                                                             thanks for
>                                                             clearing
>                                                             my doubts .
>
>                                                             But I have
>                                                             one more
>                                                             question ,
>                                                             say I have
>                                                             10 TB data
>                                                             in the
>                                                             pipeline .
>
>                                                             Is it
>                                                             perfectly
>                                                             OK to use
>                                                             hadopo fs
>                                                             put
>                                                             command to
>                                                             upload
>                                                             these
>                                                             files of
>                                                             size 10 TB
>                                                             and is
>                                                             there any
>                                                             limit to
>                                                             the file
>                                                             size using
>                                                             hadoop
>                                                             command
>                                                             line . Can
>                                                             hadoop put
>                                                             command
>                                                             line work
>                                                             with huge
>                                                             data.
>
>                                                             Thanks in
>                                                             advance
>
>
>                                                             On Sat,
>                                                             May 11,
>                                                             2013 at
>                                                             4:24 PM,
>                                                             Nitin
>                                                             Pawar
>                                                             <nitinpawar432@gmail.com
>                                                             <mailto:nitinpawar432@gmail.com>>
>                                                             wrote:
>
>                                                                 first
>                                                                 of all
>                                                                 ..
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do not
>                                                                 get
>                                                                 100 PB
>                                                                 of
>                                                                 data
>                                                                 in one
>                                                                 go.
>                                                                 Its an
>                                                                 accumulating
>                                                                 process and
>                                                                 most
>                                                                 of the
>                                                                 companies
>                                                                 do
>                                                                 have a
>                                                                 data
>                                                                 pipeline
>                                                                 in
>                                                                 place
>                                                                 where
>                                                                 the
>                                                                 data
>                                                                 is
>                                                                 written to
>                                                                 hdfs
>                                                                 on a
>                                                                 frequency
>                                                                 basis
>                                                                 and
>                                                                  then
>                                                                 its
>                                                                 retained
>                                                                 on
>                                                                 hdfs
>                                                                 for
>                                                                 some
>                                                                 duration
>                                                                 as per
>                                                                 needed
>                                                                 and
>                                                                 from
>                                                                 there
>                                                                 its
>                                                                 sent
>                                                                 to
>                                                                 archivers
>                                                                 or
>                                                                 deleted.
>
>                                                                 For
>                                                                 data
>                                                                 management
>                                                                 products,
>                                                                 you
>                                                                 can
>                                                                 look
>                                                                 at
>                                                                 falcon
>                                                                 which
>                                                                 is
>                                                                 open
>                                                                 sourced by
>                                                                 inmobi
>                                                                 along
>                                                                 with
>                                                                 hortonworks.
>
>
>                                                                 In any
>                                                                 case,
>                                                                 if you
>                                                                 want
>                                                                 to
>                                                                 write
>                                                                 files
>                                                                 to
>                                                                 hdfs
>                                                                 there
>                                                                 are
>                                                                 few
>                                                                 options available
>                                                                 to you
>                                                                 1)
>                                                                 Write
>                                                                 your
>                                                                 dfs
>                                                                 client
>                                                                 which
>                                                                 writes
>                                                                 to dfs
>                                                                 2) use
>                                                                 hdfs proxy
>                                                                 3)
>                                                                 there
>                                                                 is webhdfs
>                                                                 4)
>                                                                 command line
>                                                                 hdfs
>                                                                 5)
>                                                                 data
>                                                                 collection
>                                                                 tools
>                                                                 come
>                                                                 with
>                                                                 support to
>                                                                 write
>                                                                 to
>                                                                 hdfs
>                                                                 like
>                                                                 flume etc
>
>
>                                                                 On
>                                                                 Sat,
>                                                                 May
>                                                                 11,
>                                                                 2013
>                                                                 at
>                                                                 4:19
>                                                                 PM,
>                                                                 Thoihen Maibam
>                                                                 <thoihen123@gmail.com
>                                                                 <mailto:thoihen123@gmail.com>>
>                                                                 wrote:
>
>                                                                     Hi
>                                                                     All,
>
>                                                                     Can anyone
>                                                                     help
>                                                                     me
>                                                                     know
>                                                                     how does
>                                                                     companies
>                                                                     like
>                                                                     Facebook
>                                                                     ,Yahoo
>                                                                     etc upload
>                                                                     bulk
>                                                                     files
>                                                                     say to
>                                                                     the tune
>                                                                     of
>                                                                     100 petabytes
>                                                                     to
>                                                                     Hadoop
>                                                                     HDFS
>                                                                     cluster
>                                                                     for processing
>                                                                     and after
>                                                                     processing
>                                                                     how they
>                                                                     download
>                                                                     those
>                                                                     files
>                                                                     from
>                                                                     HDFS
>                                                                     to
>                                                                     local
>                                                                     file
>                                                                     system.
>
>                                                                     I
>                                                                     don't
>                                                                     think
>                                                                     they
>                                                                     might
>                                                                     be
>                                                                     using
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs
>                                                                     put to
>                                                                     upload
>                                                                     files
>                                                                     as
>                                                                     it
>                                                                     would
>                                                                     take
>                                                                     too long
>                                                                     or
>                                                                     do
>                                                                     they
>                                                                     divide
>                                                                     say 10
>                                                                     parts
>                                                                     each
>                                                                     10
>                                                                     petabytes
>                                                                     and compress
>                                                                     and use
>                                                                     the command
>                                                                     line
>                                                                     hadoop
>                                                                     fs put
>
>                                                                     Or
>                                                                     if
>                                                                     they
>                                                                     use any
>                                                                     tool
>                                                                     to
>                                                                     upload
>                                                                     huge
>                                                                     files.
>
>                                                                     Please
>                                                                     help
>                                                                     me .
>
>                                                                     Thanks
>                                                                     thoihen
>
>
>
>
>                                                                 -- 
>                                                                 Nitin
>                                                                 Pawar
>
>
>
>
>
>                                                         -- 
>                                                         Nitin Pawar
>
>
>
>
>
>                                                 -- 
>                                                 Nitin Pawar
>
>
>
>
>
>                                         -- 
>                                         Nitin Pawar
>
>
>
>
>
>
>
>
>
>                 -- 
>                 Nitin Pawar
>
>
>
>
>


Mime
View raw message