spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
Date Thu, 24 Aug 2017 12:41:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139986#comment-16139986
] 

Sean Owen commented on SPARK-21797:
-----------------------------------

Sure, but in all events, this is an operation that is fine with Spark, but not fine with something
between the AWS SDK and AWS. It's not something Spark can fix.

If source data is in S3, there's no way to avoid copying it from S3. Intermediate data produced
by Spark can't live on S3 as it's too eventually consistent. Some final result could. And
yeah you pay to read/write S3 so in some use cases might be more economical to keep intensely
read/written data close to the compute workers for a time, rather than write/read to S3 between
several closely related jobs.

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Boris Clémençon 
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in
AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service: Amazon S3; Status Code:
403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are in Glacier.
I could always read specifically each date, add the column with current date and reduce(_
union _) at the end, but not pretty and it should not be necessary.
> Is there any tip to read available data in the datastore even with old data in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message