hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Thu, 16 May 2013 14:18:48 GMT
Just wanted to bring one thing up.

Using distcp to upload local file to hdfs might not work if launched from a
gateway host.Gateway hosts typically configured to only submit jobs and are
only aware of NN and JT, so mappers running in various data nodes might not
have access to the local fs of data node.

distcp is possible when data is loaded into the local fs of any of the
datanodes and then distcp  is run from there.

Thanks,
Rahul


On Sun, May 12, 2013 at 7:51 PM, Chris Mawata <chris.mawata@gmail.com>wrote:

>  It is being read sequentially but is it not potentially being written on
> multiple drives and since reading is typically faster than writing don't
> you still get a little benefit of parallelism?
>
>
> On 5/12/2013 8:55 AM, Mohammad Tariq wrote:
>
> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be
able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
>  Please comment if you think I am wring somewhere.
>
>  Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>>  Yes , it's a MR job under the hood . my question was that you wrote
>> that using distcp you loose the benefits  of parallel processing of Hadoop.
>> I think the MR job of distcp divides files into individual map tasks based
>> on the total size of the transfer , so multiple mappers would still be
>> spawned if the size of transfer is huge and they would work in parallel.
>>
>>  Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>>  On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>>  Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>>  Thanks to both of you!
>>>>
>>>>   Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>>  example:
>>>>>
>>>>>  hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>>  @Tariq can you point me to some resource which shows how distcp
is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>>  isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>   Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>>  Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>>  Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>>  And, bigger the files lesser the metadata hence lesser
memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>>  Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee
<
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>>  IMHO,I think the statement about NN with regard
to block
>>>>>>>>>> metadata is more like a general statement. Even if
you put lots of small
>>>>>>>>>> files of combined size 10 TB , you need to have a
capable NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq
<
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin
is talking about this
>>>>>>>>>>>> :
>>>>>>>>>>>>
>>>>>>>>>>>>  Every file and block in HDFS is treated
as an object and for
>>>>>>>>>>>> each object around 200B of metadata get created.
So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata,
since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important
metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>>  Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>>  @Thoihen : As Nitin has said, when you talk
about that much
>>>>>>>>>>>> data you don't actually just do a "put".
You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach
would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as
Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation
tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>>  Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar
<
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because
it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file.
so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your
entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will
be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul
Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  @Nitin , parallel dfs to write to
hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable
NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline
, means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would
contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store
the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM,
Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct
answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  when you say , you have files
worth 10TB files and you
>>>>>>>>>>>>>>> want to upload  to HDFS, several
factors come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  1) Is the machine in the same
network as your hadoop
>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network
will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and Most importantly I assume
that you have a capable
>>>>>>>>>>>>>>> hadoop cluster. By that I mean
you have a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  I would definitely not write
files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in
parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel
manner and in my
>>>>>>>>>>>>>>> experience it has not failed
when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38
PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   @Nitin Pawar , thanks for
clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  But I have one more question
, say I have 10 TB data in
>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Is it perfectly OK to use
hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB
and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can
hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24
PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most
of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an
accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline
in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and 
then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there
its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For data management
products, you can look at falcon
>>>>>>>>>>>>>>>>> which is open sourced
by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  In any case, if you
want to write files to hdfs there
>>>>>>>>>>>>>>>>> are few options available
to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client
which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools
come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013
at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Can anyone help
me know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload
bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for
processing
>>>>>>>>>>>>>>>>>>  and after processing
how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file
system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  I don't think they
might be using the command line
>>>>>>>>>>>>>>>>>> hadoop fs put to
upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>> say 10 parts each
10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Or if they use any
tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>>>>>>  thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>
>

Mime
View raw message