hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 13:05:44 GMT
Soon after replying I realized something else related to this.

Say we have a single file in HDFS (hdfs configured for default block size
64 MB) and the size of the file is 1 GB. Now if we use distcp to move it
from the current hdfs to another one , then
whether there would be any parallelism or just a single map task would be
fired?

As per what I have read , a mapper is launcher for a complete file or a set
of files. It doesn't operate at block level.So no parallelism even if the
file resides in HDFS.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:28 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> yeah you are right I mis read your earlier post.
>
> Thanks,
> Rahul
>
>
> On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't
be able to exploit parallelism as entire file is present on
>> a single machine. So no multiple TTs.
>>
>> Please comment if you think I am wring somewhere.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Yes , it's a MR job under the hood . my question was that you wrote that
>>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>>> think the MR job of distcp divides files into individual map tasks based on
>>> the total size of the transfer , so multiple mappers would still be spawned
>>> if the size of transfer is huge and they would work in parallel.
>>>
>>> Correct me if there is anything wrong!
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> No. distcp is actually a mapreduce job under the hood.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>
>>>>> Thanks to both of you!
>>>>>
>>>>> Rahul
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>>
>>>>>> you can do that using file:///
>>>>>>
>>>>>> example:
>>>>>>
>>>>>>
>>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>
>>>>>>> @Tariq can you point me to some resource which shows how distcp
is
>>>>>>> used to upload files from local to hdfs.
>>>>>>>
>>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>>> present in the hadoop's fs?
>>>>>>>
>>>>>>>  Rahul
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You'r welcome :)
>>>>>>>>
>>>>>>>> Warm Regards,
>>>>>>>> Tariq
>>>>>>>> cloudfront.blogspot.com
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Tariq!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>>
>>>>>>>>>> And, bigger the files lesser the metadata hence lesser
memory
>>>>>>>>>> consumption.
>>>>>>>>>>
>>>>>>>>>> Warm Regards,
>>>>>>>>>> Tariq
>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee
<
>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> IMHO,I think the statement about NN with regard
to block
>>>>>>>>>>> metadata is more like a general statement. Even
if you put lots of small
>>>>>>>>>>> files of combined size 10 TB , you need to have
a capable NN.
>>>>>>>>>>>
>>>>>>>>>>> can disct cp be used to copy local - to - hdfs
?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rahul
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar
<
>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad
Tariq <
>>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for barging in guys. I think Nitin
is talking about this
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Every file and block in HDFS is treated
as an object and for
>>>>>>>>>>>>> each object around 200B of metadata get
created. So the NN should be
>>>>>>>>>>>>> powerful enough to handle that much metadata,
since it is going to be
>>>>>>>>>>>>> in-memory. Actually memory is the most
important metric when it comes to
>>>>>>>>>>>>> NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Thoihen : As Nitin has said, when you
talk about that much
>>>>>>>>>>>>> data you don't actually just do a "put".
You could use something like
>>>>>>>>>>>>> "distcp" for parallel copying. A better
approach would be to use a data
>>>>>>>>>>>>> aggregation tool like Flume or Chukwa,
as Nitin has already pointed.
>>>>>>>>>>>>> Facebook uses their own data aggregation
tool, called Scribe for this
>>>>>>>>>>>>> purpose.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>>> Tariq
>>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin
Pawar <
>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> NN would still be in picture because
it will be writing a lot
>>>>>>>>>>>>>> of meta data for each individual
file. so you will need a NN capable enough
>>>>>>>>>>>>>> which can store the metadata for
your entire dataset. Data will never go to
>>>>>>>>>>>>>> NN but lot of metadata about data
will be on NN so its always good idea to
>>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM,
Rahul Bhattacharjee <
>>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Nitin , parallel dfs to write
to hdfs is great , but could
>>>>>>>>>>>>>>> not understand the meaning of
capable NN. As I know , the NN would not be a
>>>>>>>>>>>>>>> part of the actual data write
pipeline , means that the data would not
>>>>>>>>>>>>>>> travel through the NN , the dfs
would contact the NN from time to time to
>>>>>>>>>>>>>>> get locations of DN as where
to store the data blocks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54
PM, Nitin Pawar <
>>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> is it safe? .. there is no
direct answer yes or no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when you say , you have files
worth 10TB files and you want
>>>>>>>>>>>>>>>> to upload  to HDFS, several
factors come into picture
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Is the machine in the
same network as your hadoop
>>>>>>>>>>>>>>>> cluster?
>>>>>>>>>>>>>>>> 2) If there guarantee that
network will not go down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and Most importantly I assume
that you have a capable
>>>>>>>>>>>>>>>> hadoop cluster. By that I
mean you have a capable namenode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would definitely not write
files sequentially in HDFS. I
>>>>>>>>>>>>>>>> would prefer to write files
in parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>>> features to speed up the
process.
>>>>>>>>>>>>>>>> you can hdfs put command
in parallel manner and in my
>>>>>>>>>>>>>>>> experience it has not failed
when we write a lot of data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38
PM, maisnam ns <
>>>>>>>>>>>>>>>> maisnam.ns@gmail.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Nitin Pawar , thanks
for clearing my doubts .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But I have one more question
, say I have 10 TB data in
>>>>>>>>>>>>>>>>> the pipeline .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is it perfectly OK to
use hadopo fs put command to upload
>>>>>>>>>>>>>>>>> these files of size 10
TB and is there any limit to the file size  using
>>>>>>>>>>>>>>>>> hadoop command line .
Can hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013
at 4:24 PM, Nitin Pawar <
>>>>>>>>>>>>>>>>> nitinpawar432@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> first of all .. most
of the companies do not get 100 PB
>>>>>>>>>>>>>>>>>> of data in one go.
Its an accumulating process and most of the companies do
>>>>>>>>>>>>>>>>>> have a data pipeline
in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>>> frequency basis and
 then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>>> needed and from there
its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For data management
products, you can look at falcon
>>>>>>>>>>>>>>>>>> which is open sourced
by inmobi along with hortonworks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In any case, if you
want to write files to hdfs there are
>>>>>>>>>>>>>>>>>> few options available
to you
>>>>>>>>>>>>>>>>>> 1) Write your dfs
client which writes to dfs
>>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>>> 5) data collection
tools come with support to write to
>>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, May 11, 2013
at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>>> thoihen123@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can anyone help
me know how does companies like Facebook
>>>>>>>>>>>>>>>>>>> ,Yahoo etc upload
bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>>> HDFS cluster
for processing
>>>>>>>>>>>>>>>>>>> and after processing
how they download those files from
>>>>>>>>>>>>>>>>>>> HDFS to local
file system.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I don't think
they might be using the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
to upload files as it would take too long or do they divide
>>>>>>>>>>>>>>>>>>> say 10 parts
each 10 petabytes and  compress and use the command line
>>>>>>>>>>>>>>>>>>> hadoop fs put
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or if they use
any tool to upload huge files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please help me
.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message