Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\))
Subject: Re: Importing Data into HDFS
From: Kai Voigt <k@123.org>
In-Reply-To: 
 <CAEi24tZRnS5MRsdb+OE2FAxa4yvNTcJtNaZmM2-XEyZ2SYxh4A@mail.gmail.com>
Date: Wed, 29 Aug 2012 23:03:30 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <D0B03E0E-F7EB-478B-8D6D-8ABBA1F1DF2B@123.org>
References: 
 <CAEi24tZRnS5MRsdb+OE2FAxa4yvNTcJtNaZmM2-XEyZ2SYxh4A@mail.gmail.com>
To: user@hadoop.apache.org

Hello,

Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <steveisoft@gmail.com>:

> Is there any way to import data into HDFS without copying it in? =
(kinda of like by reference)
> I'm pretty sure the answer to this is no.
>=20
> What I'm looking for is something that will take existing NFS data and =
access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run =
MapReduce etc. on without copying it into HDFS.
>=20
> If the data were in S3, could I run MapReduce on it?


Hadoop has a filesystem abstraction layer that supports many physical =
filesystem implementation. Such as HDFS of course, but also the local =
filesystem, S3, FTP, and others.

You simply loose data locality if you're running MapReduce on data that =
is -well- not local to where it's been processed.

With data stored in S3, a common solution is to fire up an EMR (elastic =
mapreduce) cluster inside Amazon's datacenter to work on your S3 data. =
It's not real data locality, but at least the processing happens in the =
same data center as your data. And once you're done processing the data, =
you can take down the EMR cluster.

Kai

--=20
Kai Voigt
k@123.org