Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 24 Aug 2017 12:41:00 +0000 (UTC)
From: "Sean Owen (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13096426.1503322550000.111150.1503578460478@Atlassian.JIRA>
In-Reply-To: <JIRA.13096426.1503322550000@Atlassian.JIRA>
References: <JIRA.13096426.1503322550000@Atlassian.JIRA> <JIRA.13096426.1503322550515@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-21797) spark cannot read partitioned data
 in S3 that are partly in glacier
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 24 Aug 2017 12:41:06 -0000


    [ https://issues.apache.org/jira/browse/SPARK-21797?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D161=
39986#comment-16139986 ]=20

Sean Owen commented on SPARK-21797:
-----------------------------------

Sure, but in all events, this is an operation that is fine with Spark, but =
not fine with something between the AWS SDK and AWS. It's not something Spa=
rk can fix.

If source data is in S3, there's no way to avoid copying it from S3. Interm=
ediate data produced by Spark can't live on S3 as it's too eventually consi=
stent. Some final result could. And yeah you pay to read/write S3 so in som=
e use cases might be more economical to keep intensely read/written data cl=
ose to the compute workers for a time, rather than write/read to S3 between=
 several closely related jobs.

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Boris Cl=C3=A9men=C3=A7on=20
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest da=
te stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=3D2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=3D2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=3D2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=3D2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet i=
n glacier, eg:
> {code:java}
> val from =3D "2017-07-15"
> val to =3D "2017-08-24"
> val path =3D "s3://my-bucket/my-dataset/"
> val X =3D spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.ser=
vices.s3.model.AmazonS3Exception: The operation is not valid for the object=
's storage class (Service: Amazon S3; Status Code: 403; Error Code: Invalid=
ObjectState; Request ID: C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions=
 are in Glacier. I could always read specifically each date, add the column=
 with current date and reduce(_ union _) at the end, but not pretty and it =
should not be necessary.
> Is there any tip to read available data in the datastore even with old da=
ta in glacier?


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org