hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Hadoop noob question
Date Sun, 12 May 2013 12:58:52 GMT
yeah you are right I mis read your earlier post.

Thanks,
Rahul


On Sun, May 12, 2013 at 6:25 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> I had said that if you use distcp to copy data *from localFS to HDFS*then you won't be
able to exploit parallelism as entire file is present on
> a single machine. So no multiple TTs.
>
> Please comment if you think I am wring somewhere.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Sun, May 12, 2013 at 6:15 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Yes , it's a MR job under the hood . my question was that you wrote that
>> using distcp you loose the benefits  of parallel processing of Hadoop. I
>> think the MR job of distcp divides files into individual map tasks based on
>> the total size of the transfer , so multiple mappers would still be spawned
>> if the size of transfer is huge and they would work in parallel.
>>
>> Correct me if there is anything wrong!
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, May 12, 2013 at 6:07 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>
>>> No. distcp is actually a mapreduce job under the hood.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Sun, May 12, 2013 at 6:00 PM, Rahul Bhattacharjee <
>>> rahul.rec.dgp@gmail.com> wrote:
>>>
>>>> Thanks to both of you!
>>>>
>>>> Rahul
>>>>
>>>>
>>>> On Sun, May 12, 2013 at 5:36 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>
>>>>> you can do that using file:///
>>>>>
>>>>> example:
>>>>>
>>>>>
>>>>> hadoop distcp hdfs://localhost:8020/somefile file:///Users/myhome/Desktop/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 5:23 PM, Rahul Bhattacharjee <
>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>
>>>>>> @Tariq can you point me to some resource which shows how distcp is
>>>>>> used to upload files from local to hdfs.
>>>>>>
>>>>>> isn't distcp a MR job ? wouldn't it need the data to be already
>>>>>> present in the hadoop's fs?
>>>>>>
>>>>>>  Rahul
>>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 10:52 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>
>>>>>>> You'r welcome :)
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 10:46 PM, Rahul Bhattacharjee <
>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Tariq!
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 11, 2013 at 10:34 PM, Mohammad Tariq <
>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> @Rahul : Yes. distcp can do that.
>>>>>>>>>
>>>>>>>>> And, bigger the files lesser the metadata hence lesser
memory
>>>>>>>>> consumption.
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, May 11, 2013 at 9:40 PM, Rahul Bhattacharjee
<
>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> IMHO,I think the statement about NN with regard to
block metadata
>>>>>>>>>> is more like a general statement. Even if you put
lots of small files of
>>>>>>>>>> combined size 10 TB , you need to have a capable
NN.
>>>>>>>>>>
>>>>>>>>>> can disct cp be used to copy local - to - hdfs ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Rahul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, May 11, 2013 at 9:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> absolutely rite Mohammad
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 11, 2013 at 9:33 PM, Mohammad Tariq
<
>>>>>>>>>>> dontariq@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for barging in guys. I think Nitin
is talking about this :
>>>>>>>>>>>>
>>>>>>>>>>>> Every file and block in HDFS is treated as
an object and for
>>>>>>>>>>>> each object around 200B of metadata get created.
So the NN should be
>>>>>>>>>>>> powerful enough to handle that much metadata,
since it is going to be
>>>>>>>>>>>> in-memory. Actually memory is the most important
metric when it comes to
>>>>>>>>>>>> NN.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I correct @Nitin?
>>>>>>>>>>>>
>>>>>>>>>>>> @Thoihen : As Nitin has said, when you talk
about that much
>>>>>>>>>>>> data you don't actually just do a "put".
You could use something like
>>>>>>>>>>>> "distcp" for parallel copying. A better approach
would be to use a data
>>>>>>>>>>>> aggregation tool like Flume or Chukwa, as
Nitin has already pointed.
>>>>>>>>>>>> Facebook uses their own data aggregation
tool, called Scribe for this
>>>>>>>>>>>> purpose.
>>>>>>>>>>>>
>>>>>>>>>>>> Warm Regards,
>>>>>>>>>>>> Tariq
>>>>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 11, 2013 at 9:20 PM, Nitin Pawar
<
>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> NN would still be in picture because
it will be writing a lot
>>>>>>>>>>>>> of meta data for each individual file.
so you will need a NN capable enough
>>>>>>>>>>>>> which can store the metadata for your
entire dataset. Data will never go to
>>>>>>>>>>>>> NN but lot of metadata about data will
be on NN so its always good idea to
>>>>>>>>>>>>> have a strong NN.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, May 11, 2013 at 9:11 PM, Rahul
Bhattacharjee <
>>>>>>>>>>>>> rahul.rec.dgp@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Nitin , parallel dfs to write to
hdfs is great , but could
>>>>>>>>>>>>>> not understand the meaning of capable
NN. As I know , the NN would not be a
>>>>>>>>>>>>>> part of the actual data write pipeline
, means that the data would not
>>>>>>>>>>>>>> travel through the NN , the dfs would
contact the NN from time to time to
>>>>>>>>>>>>>> get locations of DN as where to store
the data blocks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Rahul
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:54 PM,
Nitin Pawar <
>>>>>>>>>>>>>> nitinpawar432@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is it safe? .. there is no direct
answer yes or no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when you say , you have files
worth 10TB files and you want
>>>>>>>>>>>>>>> to upload  to HDFS, several factors
come into picture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) Is the machine in the same
network as your hadoop cluster?
>>>>>>>>>>>>>>> 2) If there guarantee that network
will not go down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and Most importantly I assume
that you have a capable hadoop
>>>>>>>>>>>>>>> cluster. By that I mean you have
a capable namenode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would definitely not write
files sequentially in HDFS. I
>>>>>>>>>>>>>>> would prefer to write files in
parallel to hdfs to utilize the DFS write
>>>>>>>>>>>>>>> features to speed up the process.
>>>>>>>>>>>>>>> you can hdfs put command in parallel
manner and in my
>>>>>>>>>>>>>>> experience it has not failed
when we write a lot of data.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:38
PM, maisnam ns <
>>>>>>>>>>>>>>> maisnam.ns@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Nitin Pawar , thanks for
clearing my doubts .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I have one more question
, say I have 10 TB data in the
>>>>>>>>>>>>>>>> pipeline .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is it perfectly OK to use
hadopo fs put command to upload
>>>>>>>>>>>>>>>> these files of size 10 TB
and is there any limit to the file size  using
>>>>>>>>>>>>>>>> hadoop command line . Can
hadoop put command line work with huge data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, May 11, 2013 at 4:24
PM, Nitin Pawar <
>>>>>>>>>>>>>>>> nitinpawar432@gmail.com>
wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> first of all .. most
of the companies do not get 100 PB of
>>>>>>>>>>>>>>>>> data in one go. Its an
accumulating process and most of the companies do
>>>>>>>>>>>>>>>>> have a data pipeline
in place where the data is written to hdfs on a
>>>>>>>>>>>>>>>>> frequency basis and 
then its retained on hdfs for some duration as per
>>>>>>>>>>>>>>>>> needed and from there
its sent to archivers or deleted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For data management products,
you can look at falcon which
>>>>>>>>>>>>>>>>> is open sourced by inmobi
along with hortonworks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In any case, if you want
to write files to hdfs there are
>>>>>>>>>>>>>>>>> few options available
to you
>>>>>>>>>>>>>>>>> 1) Write your dfs client
which writes to dfs
>>>>>>>>>>>>>>>>> 2) use hdfs proxy
>>>>>>>>>>>>>>>>> 3) there is webhdfs
>>>>>>>>>>>>>>>>> 4) command line hdfs
>>>>>>>>>>>>>>>>> 5) data collection tools
come with support to write to
>>>>>>>>>>>>>>>>> hdfs like flume etc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, May 11, 2013
at 4:19 PM, Thoihen Maibam <
>>>>>>>>>>>>>>>>> thoihen123@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can anyone help me
know how does companies like Facebook
>>>>>>>>>>>>>>>>>> ,Yahoo etc upload
bulk files say to the tune of 100 petabytes to Hadoop
>>>>>>>>>>>>>>>>>> HDFS cluster for
processing
>>>>>>>>>>>>>>>>>> and after processing
how they download those files from
>>>>>>>>>>>>>>>>>> HDFS to local file
system.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think they
might be using the command line hadoop
>>>>>>>>>>>>>>>>>> fs put to upload
files as it would take too long or do they divide say 10
>>>>>>>>>>>>>>>>>> parts each 10 petabytes
and  compress and use the command line hadoop fs put
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or if they use any
tool to upload huge files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please help me .
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> thoihen
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message