Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Mon, 13 Oct 2014 04:30:33 +0000 (UTC)
From: "Steve Loughran (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12747530.1413054024000.251165.1413174633860@Atlassian.JIRA>
In-Reply-To: <JIRA.12747530.1413054024000@Atlassian.JIRA>
References: <JIRA.12747530.1413054024000@Atlassian.JIRA>
 <JIRA.12747530.1413054024669@arcas>
Subject: [jira] [Commented] (MAPREDUCE-6127) SequenceFile crashes with
 encrypted files that are shorter than FileSystem.getStatus(path)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168940#comment-14168940 ] 

Steve Loughran commented on MAPREDUCE-6127:
-------------------------------------------

While there's certainly scope for improving resilience, the "a file is as long as it says it is" is called out as [[an invariant of any hadoop-compatible filesystem|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md]]

bq.  the size of the data stream equals the size of the file as returned by FileSystem.getFileStatus(Path p)

Changing that breaks a lot; there's assumptions that things like {{seek(length-1}}; read()}} is valid for all lengths > 0. thinks like splitting files are all based on the assumption there's a 1:1 mapping between offsets and data.

Which is why the HDFS-at-rest encryption reports a different file length to the caller than its actual length: clients see how many bytes they can read, not how many there are.

what is doing the encyption? This is amazon's own S3 encryption?

> SequenceFile crashes with encrypted files that are shorter than FileSystem.getStatus(path)
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6127
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6127
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>         Environment: Amazon EMR 3.0.4
>            Reporter: Corby Wilson
>
> Encrypted files are often padded to allow for proper encryption on a 2^n-bit boundary.  As a result, the encrypted file might be a few bytes bigger than the unencrypted file.
> We have a case where an encrypted files is 2 bytes bigger due to padding.
> When we run a HIVE job on the file to get a record count (select count(*) from <table>) it runs org.apache.hadoop.mapred.SequenceFileRecordReader and loads the file in through a custom FS InputStream.
> The InputStream decrypts the file  as it gets read in.  Splits are properly handled as it extends both Seekable and Positioned Readable.
> When the org.apache.hadoop.io.SequenceFile class intializes it reads in the file size from the FileMetadata which returns the file size of the encrypted file on disk (or in this case in S3).
> However, the actual file size is 2 bytes less, so the InputStream will return EOF (-1) before the SequenceFile thinks it's done.
> As a result, the SequenceFile$Reader tried to run the next->readRecordLength after the file has been closed and we get a crash.
> The SequenceFile class SHOULD, instead, pay attention to the EOF marker from the stream instead of the file size reported in the metadata and set the 'more' flag accordingly.
> Sample stack dump from crash
> 2014-10-10 21:25:27,160 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.io.IOException: java.io.EOFException
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:304)
> 	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:220)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
> 	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:433)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
> 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
> 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> Caused by: java.io.IOException: java.io.EOFException
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> 	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
> 	at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
> 	at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
> 	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:302)
> 	... 11 more
> Caused by: java.io.EOFException
> 	at java.io.DataInputStream.readInt(DataInputStream.java:392)
> 	at org.apache.hadoop.io.SequenceFile$Reader.readRecordLength(SequenceFile.java:2332)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2363)
> 	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2500)
> 	at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
> 	at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
> 	... 15 more
> Sample stack dump:


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)