Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70983186D1 for ; Mon, 4 Jan 2016 20:22:29 +0000 (UTC) Received: (qmail 56402 invoked by uid 500); 4 Jan 2016 20:22:29 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 56361 invoked by uid 500); 4 Jan 2016 20:22:29 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 56351 invoked by uid 99); 4 Jan 2016 20:22:29 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jan 2016 20:22:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id CFF1E1804C6 for ; Mon, 4 Jan 2016 20:22:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=wealthfront-com.20150623.gappssmtp.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id AK6jm9k4yzq4 for ; Mon, 4 Jan 2016 20:22:18 +0000 (UTC) Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id C542820D53 for ; Mon, 4 Jan 2016 20:22:17 +0000 (UTC) Received: by mail-ob0-f173.google.com with SMTP id 18so350802649obc.2 for ; Mon, 04 Jan 2016 12:22:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=wealthfront-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=dxu4i50wZ5DUyCdL8mULqJuVjJ5Z/ElMOMij4K4XHag=; b=sO/9l48Sd1MkCddGChn9D/UDaUTsJk3lk66mSD7P1fBZ4+auKHmDOSLAN8q84f869J mrrlyw2EwWsRpXngv9LJPT5DnQUi1WItQf4ms3QXIgAXQptPJYBRmeQdqMC/dllimHbJ QELA9WbG4HwvVf4WBERlJLkAcfvgwUUW15XYQWGP0W3bJMqX68fKyUT2N2d34j9yMICK c2RQ4mBqyVbJecHbyj/S0xTZUy7WnDYkLPiE5deWsubnLBWdSSB4ntwWxuUmbLvETu0y Q8ByarEl+9sNo/h3NMS+vixnv+5Gq58snd6mdc5ehOO6ySeDL9Y1oAFpJ8cWuhsXffdi zdPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=dxu4i50wZ5DUyCdL8mULqJuVjJ5Z/ElMOMij4K4XHag=; b=ioH2EJrpwoQHmWFMXZh7LDuXp/rhKleblqNK7AzjbHJJmgcpFK9zIaoytV6yfuPxRr hq2hRc3rT+57xfwgQ/WKW8n/v43Ron6CBamaVrvUWaYWeV8alKcv+zG1/RH8uG2v4vq/ vNSKhD0NNybgxdOlAtEFxH5iAkPJ2+IiL8X29/r0nh1HL4EwIeV7lDjiHn3pxTwWfmyZ fx5zCBfW8IOhH/Yv6HW4XnvqxjGhpKlZCQRLEFq+2HQMw18ia/c4VnZc6QaSuTbmfB6p q0GPhBX0+CBdBgtcK/W796qY3RiEhziBzGeBPEuWg+j7R+H+u1uXczAUeklAwX8O6mJN Ug0g== X-Gm-Message-State: ALoCoQloWd7HuuEnbt2wBwwHC99GffRgPFaC6P2qH/5s0q51cLDaXsNIlXRPx5H1Uu1b+z1Uz7diSLO02KUfyN8OR+qAeD5JY+LBvElIXtbgRvsmY4YpEjE= MIME-Version: 1.0 X-Received: by 10.60.40.39 with SMTP id u7mr41192361oek.54.1451938937151; Mon, 04 Jan 2016 12:22:17 -0800 (PST) Received: by 10.202.201.81 with HTTP; Mon, 4 Jan 2016 12:22:17 -0800 (PST) In-Reply-To: References: Date: Mon, 4 Jan 2016 12:22:17 -0800 Message-ID: Subject: Re: Sparkpipeline hit credentials issue when trying to write to S3 From: Yan Yang To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=089e0139ffbe95c2a8052887e05c --089e0139ffbe95c2a8052887e05c Content-Type: text/plain; charset=UTF-8 Hi Jeff, We are using s3n://bucket/path Thanks Yan On Mon, Jan 4, 2016 at 12:19 PM, Jeff Quinn wrote: > Hey Yan, > > Just a hunch but from that stacktrace it looks like you might be using the > outdated s3-hadoop filesystem, is the url you are trying to write to of the > form s3://bucket/path or s3n://bucket/path? > > Thanks! > > Jeff > > On Mon, Jan 4, 2016 at 12:15 PM, Yan Yang wrote: > >> Hi >> >> I have tried to set up a Sparkpipeline to run within AWS EMR. >> >> The code is as below: >> >> SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi"); >> JavaSparkContext jsc = new JavaSparkContext(sparkConf); >> SparkPipeline pipeline = new SparkPipeline(jsc, "spark-app"); >> >> PCollection input = pipeline.read(From.avroFile(inputPaths, >> Input.class)); >> PCollection output = process(input); >> pipeline.write(output, To.avroFile(outputPath)); >> >> The read works and a simple spark write such as calling saveAsTextFile() >> on an RDD object also works. >> >> However write using pipeline.write() hits below exceptions. I have tried >> to set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey in sparkConf >> with the same result: >> >> java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). >> at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70) >> at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:80) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) >> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> at org.apache.hadoop.fs.s3native.$Proxy9.initialize(Unknown Source) >> at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:326) >> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2644) >> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) >> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678) >> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660) >> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374) >> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) >> at org.apache.avro.mapred.FsInput.(FsInput.java:37) >> at org.apache.crunch.types.avro.AvroRecordReader.initialize(AvroRecordReader.java:54) >> at org.apache.crunch.impl.mr.run.CrunchRecordReader.initialize(CrunchRecordReader.java:150) >> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:153) >> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124) >> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:88) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> Thanks >> Yan >> > > --089e0139ffbe95c2a8052887e05c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Jeff,

We are using=C2=A0s3n://bucket/path

Thank= s
Yan
<= div class=3D"gmail_extra">
On Mon, Jan 4, 201= 6 at 12:19 PM, Jeff Quinn <jeff@nuna.com> wrote:
Hey Yan,

Just a hunc= h but from that stacktrace it looks like you might be using the outdated s3= -hadoop filesystem, is the url you are trying to write to of the form s3://= bucket/path or s3n://bucket/path?

Thanks!

Jeff
<= /font>

On Mon, Jan 4, 2016 at 12:15 PM, = Yan Yang <yan@wealthfront.com> wrote:
Hi

I have tried to set up= a Sparkpipeline to run within AWS EMR.

The code i= s as below:

SparkConf sparkConf =3D new Spark= Conf().setAppName("JavaSparkPi");
JavaSparkContext jsc = =3D new JavaSparkContext(sparkConf);
SparkPipeline pipeline =3D n= ew SparkPipeline(jsc, "spark-app");

PCol= lection<Input> input =3D pipeline.read(From.avroFile(inputPaths, Inpu= t.class));
PCollection<Output> output =3D process(input);
pipeline.write(output, To.avroFile(outputPath));
<= br>
The read works and a simple spark write such as calling saveA= sTextFile() on an RDD object also works.=C2=A0

How= ever write using pipeline.write() hits below exceptions. I have tried to se= t=C2=A0fs.s3n.awsAcce= ssKeyId and fs.s3n.awsSecretAccessKey in sparkConf with the same res= ult:

java.lang.IllegalArgumentException: AWS Ac=
cess Key ID and Secret Access Key must be specified as the username or pass=
word (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId o=
r fs.s3n.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
	at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Je=
ts3tNativeFileSystemStore.java:80)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja=
va:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso=
rImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv=
ocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio=
nHandler.java:102)
	at org.apache.hadoop.fs.s3native.$Proxy9.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3Fil=
eSystem.java:326)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2644)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.avro.mapred.FsInput.<init>(FsInput.java:37)
	at org.apache.crunch.types.avro.AvroRecordReader.initialize(AvroRecordRead=
er.java:54)
	at org.apache.crunch.impl.mr.run.CrunchRecordReader.initialize(CrunchRecor=
dReader.java:150)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.sca=
la:153)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38=
)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:=
73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:=
41)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.ja=
va:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.j=
ava:615)
	at java.lang.Thread.run(Thread.java:745)
Thanks
Yan


--089e0139ffbe95c2a8052887e05c--