hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliott Clark <elliott.neil.cl...@gmail.com>
Subject Re: S3-->HDFS -->MR or S3-->MR ?
Date Wed, 20 Jun 2012 18:08:09 GMT
It's faster to run directly on s3; we saw a 20% improvement but that's very
dependent on data set. That assumes that your job only runs over the data
once.  If you ever run two jobs over the data then pulling it into hdfs is
the perf winner.
Elliott Clark

On Tue, Jun 19, 2012 at 9:56 PM, Yang <teddyyyy123@gmail.com> wrote:

> on the other hand, I found that hadoop commands work with S3 file system
> naturally,
> so we could let our MR jobs directly consume S3, and directly dump out to
> S3.
> are there any speed/performance implications? a rough guess is that it's
> probably going to save
> a little if we access S3 directly, but not much different, since either a
> separate copy or direct consumption
> both have to go through the same pipe first. ???

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message