spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Boris Clémençon (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
Date Thu, 24 Aug 2017 12:38:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978
] 

Boris Clémençon  edited comment on SPARK-21797 at 8/24/17 12:37 PM:
--------------------------------------------------------------------

Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. Concretely,
I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest
dates in Glacier (and a few months later, delete it altogether). I want to read only most
recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but
even that, I cannot do it. Do we understand each other? 

Besides, why do you say that it a niche use case? Reading partitioned data on S3 seems quite
normal to me.

And why do you say reading data from S3 is a "very, very expensive way to work with data"?
According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we
operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other
hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster
with enough disk to store the whole dataset, or at least the relevant dates (whereas you may
want to process a few columns, ie a fraction of the initial dataset). I would like you expertise
about that.

In any case, I understand your and Sean's argument though which says that it is to AWS to
solve the problem.


was (Author: clemencb):
Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. Concretely,
I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest
dates in Glacier (and a few months later, delete it altogether). I want to read only most
recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but
even that, I cannot do it. Do we understand each other? 

Besides, why do you say that it a niche use case? Reading partitioned data on S3 seems quite
normal to me.

and why do you say reading data from S3 is a "very, very expensive way to work with data"?
According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we
operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other
hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster
with enough disk to store the whole dataset, or at least the relevant dates (whereas you may
want to process a few columns, ie a fraction of the initial dataset). I would like you expertise
about that.

In any case, I understand your and Sean's argument though which says that it is to AWS to
solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Boris Clémençon 
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in
AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The operation is not valid for the object's storage class (Service: Amazon S3; Status Code:
403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are in Glacier.
I could always read specifically each date, add the column with current date and reduce(_
union _) at the end, but not pretty and it should not be necessary.
> Is there any tip to read available data in the datastore even with old data in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message