Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 61059E531 for ; Mon, 28 Jan 2013 11:09:15 +0000 (UTC) Received: (qmail 63912 invoked by uid 500); 28 Jan 2013 11:09:10 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 61301 invoked by uid 500); 28 Jan 2013 11:09:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 60629 invoked by uid 99); 28 Jan 2013 11:09:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 11:09:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates 209.85.214.180 as permitted sender) Received: from [209.85.214.180] (HELO mail-ob0-f180.google.com) (209.85.214.180) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 11:08:57 +0000 Received: by mail-ob0-f180.google.com with SMTP id ef5so2555518obb.39 for ; Mon, 28 Jan 2013 03:08:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=pVSXfbxPBqX0WzDO5OC+fkxs+pXHSmI0yRr3Og/pWPc=; b=HG+1G2usYjMz+ypGw/JLtofguSl5wK1eP5LjIdEm2DCu0Drf+oQypTz9FN8IKVnU1p 97VZ+HHr/iSlZ//YWBify5xmOsVLIjkFoH/TzQeZe3hFlukWD0eTawUz2MY956vdC7+1 8QQoAeQnz+n4mzt+puf+sZxJkXu0cCcYErfikU3qRmKaHOhDlNlLC+cK8YuEjKLhaBUY x6eCV2Q7uViMG5KrXkY8HA3DL558HCHOQ7GGB01wqUHAU4FBzpM/2msivaVKu3puhSuv 9wHHdzwOymh8OH/vGai3uzvO4QJXOHqHKaDmB0nCIjkuqNpksJf+Rvuapn6eormoMtwg cnPg== X-Received: by 10.182.118.105 with SMTP id kl9mr1828708obb.52.1359371315891; Mon, 28 Jan 2013 03:08:35 -0800 (PST) MIME-Version: 1.0 Received: by 10.76.143.9 with HTTP; Mon, 28 Jan 2013 03:07:55 -0800 (PST) In-Reply-To: References: <5103FAF9.50606@cse.ohio-state.edu> From: Mohammad Tariq Date: Mon, 28 Jan 2013 16:37:55 +0530 Message-ID: Subject: Re: Difference between HDFS and local filesystem To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=f46d0447a29f66cf9204d4574c28 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0447a29f66cf9204d4574c28 Content-Type: text/plain; charset=ISO-8859-1 Hello Sundeep, As Harsh has said it doesn't make much sense to use MR with the native FS. If you really want to leverage the power of Hadoop, you should use MR+HDFS combo, as "Divide and Rule" is Hadoop's strength. It's a distributed system where each component gets its own piece of work to do in parallel with other components, unlike the grid computing paradigm where several machines work on the same piece together by sharing resources like memory and so forth. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 26, 2013 at 10:16 PM, Preethi Vinayak Ponangi < vinayakponangi@gmail.com> wrote: > Yes. It's possible to use your local file system instead of HDFS. As you > said, doesn't really matter when you are running a pseudo-distributed > cluster. This is generally fine if your dataset is fairly small. The place > where HDFS access really shines is if your file is huge, generally several > TB or PB. That is when individual mappers can access different partitioned > data on different nodes improving performance. > > In a fully distributed mode, your data gets partitioned and gets stored on > several different nodes on HDFS. > But when you use local data, the data is not replication or partitioned, > it's just like accessing a single file. > > > On Sat, Jan 26, 2013 at 9:49 AM, Sundeep Kambhampati < > kambhamp@cse.ohio-state.edu> wrote: > >> Hi Users, >> I am kind of new to MapReduce programming I am trying to understand the >> integration between MapReduce and HDFS. >> I could understand MapReduce can use HDFS for data access. But is >> possible not to use HDFS at all and run MapReduce programs? >> HDFS does file replication and partitioning. But if I use the following >> command to run the Example MaxTemperature >> >> bin/hadoop jar /usr/local/hadoop/maxtemp.jar MaxTemperature >> file:///usr/local/ncdcinput/**sample.txt file:///usr/local/out4 >> >> instead of >> >> bin/hadoop jar /usr/local/hadoop/maxtemp.jar MaxTemperature >> usr/local/ncdcinput/sample.txt usr/local/out4 ->> this will use hdfs >> file system. >> >> it uses local file system files and writing to local file system when I >> run in pseudo distributed mode. Since it is single node there is no problem >> of non local data. >> What happens in a fully distributed mode. Will the files be copied to >> other machines or will it throw errors? will the files be replicated and >> will they be partitioned for running MapReduce if i use Localfile system? >> >> Can someone please explain. >> >> Regards >> Sundeep >> >> >> >> >> > --f46d0447a29f66cf9204d4574c28 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hello Sundeep,

=A0 =A0 =A0 As Har= sh has said it doesn't make much sense to use MR with the native FS. If= you really want to leverage the power of Hadoop, you should use MR+HDFS co= mbo, as "Divide and Rule" =A0is Hadoop's strength. It's a= distributed system where each component gets its own piece of work to do i= n parallel with other components, unlike the grid computing paradigm where = several machines work on the same piece together by sharing resources like = memory and so forth.



On Sat, Jan 26, 2013 at 10:16 PM, Preeth= i Vinayak Ponangi <vinayakponangi@gmail.com> wrote:
Yes. It's possible to use your local file system instead of HDFS. As yo= u said, doesn't really matter when you are running a pseudo-distributed= cluster. This is generally fine if your dataset is fairly small. The place= where HDFS access really shines is if your file is huge, generally several= TB or PB. That is when individual mappers can access different partitioned= data on different nodes improving performance.

In a fully distributed mode, your data gets partitioned and = gets stored on several different nodes on HDFS.
But when you use = local data, the data is not replication or partitioned, it's just like = accessing a single file.


On Sat, Jan 26, 2013 at 9:49 AM, Sundeep Kam= bhampati <kambhamp@cse.ohio-state.edu> wrote:
Hi Users,
I am kind of new to MapReduce programming I am trying to understand the int= egration between MapReduce and HDFS.
I could understand MapReduce can use HDFS for data access. But is possible = not to use HDFS at all and run MapReduce programs?
HDFS does file replication and partitioning. But if I use the following com= mand to run the Example MaxTemperature

=A0bin/hadoop jar /usr/local/hadoop/maxtemp.jar MaxTemperature file:///usr/= local/ncdcinput/sample.txt file:///usr/local/out4

instead of

=A0bin/hadoop jar /usr/local/hadoop/maxtemp.jar MaxTemperature usr/local/nc= dcinput/sample.txt usr/local/out4 =A0 =A0 ->> this will use hdfs file= system.

it uses local file system files and writing to local file system when I run= in pseudo distributed mode. Since it is single node there is no problem of= non local data.
What happens in a fully distributed mode. Will the files be copied to other= machines or will it throw errors? will the files be replicated and will th= ey be partitioned for running MapReduce if i use Localfile system?

Can someone please explain.

Regards
Sundeep






--f46d0447a29f66cf9204d4574c28--