hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Foley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-16109) Parquet reading S3AFileSystem causes EOF
Date Fri, 01 Mar 2019 00:49:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781137#comment-16781137
] 

Matt Foley edited comment on HADOOP-16109 at 3/1/19 12:48 AM:
--------------------------------------------------------------

Yes, I'm thinking at [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L264] we
need

{{ && diff < forwardSeekLimit; }} instead of {{ <= }}

What do you think?

The big question I have is, some of the text description talks about "reading past the already
active readahead range", i.e., past {{remainingInCurrentRequest}}, as being a problem, but
it seems to me that should be okay; the problem documented so far is *seeking* past {{remainingInCurrentRequest}} (specifically
to exactly the end of CurrentRequest, which is incorrectly guarded by the above L264 inequality)
and then not doing a stream close, which causes the problem.  Do you know if, say, seeking
to a few bytes before the end of CurrentRequest, then reading past it (when the S3 file does
indeed have more to read), also causes an EOF, or does the stream machinery handle that case
correctly?

I'm putting together a test platform so I can answer such questions myself, but it will take
me a few hours; I haven't worked in s3a before.

 


was (Author: mattf):
Yes, I'm thinking at [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L264] we
need

`&& diff < forwardSeekLimit;` instead of `<=`

What do you think?

The big question I have is, some of the text description talks about "reading past the already
active readahead range", i.e., past `remainingInCurrentRequest`, as being a problem, but it
seems to me that should be okay; the problem documented so far is *seeking* past `remainingInCurrentRequest` (specifically
to exactly the end of CurrentRequest, which is incorrectly guarded by the above L264 inequality)
and then not doing a stream close, which causes the problem.  Do you know if, say, seeking
to a few bytes before the end of CurrentRequest, then reading past it (when the S3 file does
indeed have more to read), also causes an EOF, or does the stream machinery handle that case
correctly?

I'm putting together a test platform so I can answer such questions myself, but it will take
me a few hours; I haven't worked in s3a before.

 

> Parquet reading S3AFileSystem causes EOF
> ----------------------------------------
>
>                 Key: HADOOP-16109
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16109
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.2, 2.8.5, 3.3.0, 3.1.2
>            Reporter: Dave Christianson
>            Assignee: Steve Loughran
>            Priority: Blocker
>
> When using S3AFileSystem to read Parquet files a specific set of circumstances causes
an  EOFException that is not thrown when reading the same file from local disk
> Note this has only been observed under specific circumstances:
>   - when the reader is doing a projection (will cause it to do a seek backwards and
put the filesystem into random mode)
>  - when the file is larger than the readahead buffer size
>  - when the seek behavior of the Parquet reader causes the reader to seek towards the
end of the current input stream without reopening, such that the next read on the currently
open stream will read past the end of the currently open stream.
> Exception from Parquet reader is as follows:
> {code}
> Caused by: java.io.EOFException: Reached the end of stream with 51 bytes left to read
>  at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>  at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>  at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>  at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>  at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>  at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>  at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>  at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.fetchNext(HadoopInputFormatBase.java:206)
>  at org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.reachedEnd(HadoopInputFormatBase.java:199)
>  at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The following example program generate the same root behavior (sans finding a Parquet
file that happens to trigger this condition) by purposely reading past the already active
readahead range on any file >= 1029 bytes in size.. 
> {code:java}
> final Configuration conf = new Configuration();
> conf.set("fs.s3a.readahead.range", "1K");
> conf.set("fs.s3a.experimental.input.fadvise", "random");
> final FileSystem fs = FileSystem.get(path.toUri(), conf);
> // forward seek reading across readahead boundary
> try (FSDataInputStream in = fs.open(path)) {
>     final byte[] temp = new byte[5];
>     in.readByte();
>     in.readFully(1023, temp); // <-- works
> }
> // forward seek reading from end of readahead boundary
> try (FSDataInputStream in = fs.open(path)) {
>  final byte[] temp = new byte[5];
>  in.readByte();
>  in.readFully(1024, temp); // <-- throws EOFException
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message