hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Rosenstrauch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10037) s3n read truncated, but doesn't throw exception
Date Mon, 14 Apr 2014 15:14:17 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968425#comment-13968425
] 

David Rosenstrauch commented on HADOOP-10037:
---------------------------------------------

FYI, I recently upgraded our clusters (from CDH 4.3.0 / Hadoop to ) and it looks like this
issue might now be solved.  I'm seeing some of the tasks of our Hadoop jobs (failing) as they
should with the following wrong-#-of-byes-read exception, which then forces a re-try of the
task.

{code}
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message
body (expected: 346403598; received: 15815108
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:184)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:204)
	at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:108)
	at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
	at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:237)
	at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:186)
	at org.apache.http.util.EntityUtils.consume(EntityUtils.java:87)
	at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
	at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
	at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
	at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
	at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:102)
	at com.macrosense.mapreduce.io.PingRecordReader.initialize(PingRecordReader.java:80)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
{code}

Looks like this fix (in ContentLengthInputStream and/or EofSensorInputStream) was added to
Apache HTTP Compoents and/or jets3t some time in the past few months

> s3n read truncated, but doesn't throw exception 
> ------------------------------------------------
>
>                 Key: HADOOP-10037
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10037
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.0.0-alpha
>         Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge)
>            Reporter: David Rosenstrauch
>         Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html
>
>
> For months now we've been finding that we've been experiencing frequent data truncation
issues when reading from S3 using the s3n:// protocol.  I finally was able to gather some
debugging output on the issue in a job I ran last night, and so can finally file a bug report.
> The job I ran last night was on a 16-node cluster (all of them AWS EC2 cc2.8xlarge machines,
running Ubuntu 13.04 and Cloudera CDH4.3.0).  The job was a Hadoop streaming job, which reads
through a large number (i.e., ~55,000) of files on S3, each of them approximately 300K bytes
in size.
> All of the files contain 46 columns of data in each record.  But I added in an extra
check in my mapper code to count and verify the number of columns in every record - throwing
an error and crashing the map task if the column count is wrong.
> If you look in the attached task logs, you'll see 2 attempts on the same task.  The first
one fails due to data truncated (i.e., my job intentionally fails the map task due to the
current record failing the column count check).  The task then gets retried on a different
machine and runs to a succesful completion.
> You can see further evidence of the truncation further down in the task logs, where it
displays the count of the records read:  the failed task says 32953 records read, while the
successful task says 63133.
> Any idea what the problem might be here and/or how to work around it?  This issue is
a very common occurrence on our clusters.  E.g., in the job I ran last night before I had
gone to bed I had already encountered 8 such failuers, and the job was only 10% complete.
 (~25,000 out of ~250,000 tasks.)
> I realize that it's common for I/O errors to occur - possibly even frequently - in a
large Hadoop job.  But I would think that if an I/O failure (like a truncated read) did occur,
that something in the underlying infrastructure code (i.e., either in NativeS3FileSystem or
in jets3t) should detect the error and throw an IOException accordingly.  It shouldn't be
up to the calling code to detect such failures, IMO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message