hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkata Puneet Ravuri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11270) Seek behavior difference between NativeS3FsInputStream and DFSInputStream
Date Wed, 05 Nov 2014 19:00:35 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198868#comment-14198868
] 

Venkata Puneet Ravuri commented on HADOOP-11270:
------------------------------------------------

Thanks for your inputs!

[~stevel@apache.org], my responses:
1. I am currently using Hadoop 2.5.1.
2. I am trying seek(len(file)).
3. No, the file size is more than 1MB.

I understand that behavior across file systems can be different. But I believe seek(<length
of file>) should be supported by s3n as well.
I have noticed that seek() method in NativeS3FsInputStream creates a new input stream by performing
a getObject() starting from seek position. This fails when seek position is length of file.
Instead we could do this:-
a. If the new seek position is greater than the current position of the stream, skip the difference
in the underlying input stream.
b. If the new seek position is less than the current position of the stream, get a new input
stream starting from this position.
I tested this change and its working. Please let me know your thoughts on this.

One impact of current behavior is that Hive reads for RCFiles stored in S3 fail when it tries
to skip columns by issuing skipBytes() on this input stream.


> Seek behavior difference between NativeS3FsInputStream and DFSInputStream
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-11270
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11270
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>            Reporter: Venkata Puneet Ravuri
>            Assignee: Venkata Puneet Ravuri
>
> There is a difference in behavior while seeking a given file present
> in S3 using NativeS3FileSystem$NativeS3FsInputStream and a file present in HDFS using
DFSInputStream.
> If we seek to the end of the file incase of NativeS3FsInputStream, it fails with exception
"java.io.EOFException: Attempted to seek or read past the end of the file". That is because
a getObject request is issued on the S3 object with range start as value of length of file.
> This is the complete exception stack:-
> Caused by: java.io.EOFException: Attempted to seek or read past the end of the file
> at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:462)
> at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
> at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieve(Jets3tNativeFileSystemStore.java:234)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at org.apache.hadoop.fs.s3native.$Proxy17.retrieve(Unknown Source)
> at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:205)
> at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:96)
> at org.apache.hadoop.fs.BufferedFSInputStream.skip(BufferedFSInputStream.java:67)
> at java.io.DataInputStream.skipBytes(DataInputStream.java:220)
> at org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer.readFields(RCFile.java:739)
> at org.apache.hadoop.hive.ql.io.RCFile$Reader.currentValueBuffer(RCFile.java:1720)
> at org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1898)
> at org.apache.hadoop.hive.ql.io.RCFileRecordReader.next(RCFileRecordReader.java:149)
> at org.apache.hadoop.hive.ql.io.RCFileRecordReader.next(RCFileRecordReader.java:44)
> at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339)
> ... 15 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message