hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6208) Block loss in S3FS due to S3 inconsistency on file rename
Date Thu, 01 Oct 2009 13:17:23 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761207#action_12761207

Tom White commented on HADOOP-6208:

This looks great.

> I was able to find one instance of a good read followed by a bad read over ~10,000 file
writes (about 420,000 total reads).

One possibility is to strengthen the requirement to n consecutive good reads rather than just
one, but that imposes extra S3 calls. Does the current patch bring the chance of a bad read
down to an acceptable level for you?

> note that there's no sleep between polls; this is to avoid slowing down the unit tests,
and assuming that the round-trip network latency is enough of a delay

Do you know how much of a delay this imposes in practice? I wonder whether we should have
a delay in order to be nice to S3. You could do this by adding a private configuration parameter
(e.g. fs.s3.verifyPollInterval) for the delay, which the tests set to zero.

Also, do you know about Jets3tS3FileSystemContractTest? It's a unit test that you run manually
to test against S3, using your own credentials (in src/test/core-site.xml). It's worth running
this as a regression test.

A couple of minor nits:

* I know some other classes in Hadoop use primes in their hash code calculations, but it isn't
really necessary. Or-ing the id and length is probably sufficient. See HDFS-288.
* We generally put single line blocks in curly braces (e.g. line 76 of EventuallyConsistentInMemoryFileSystemStore).

> Block loss in S3FS due to S3 inconsistency on file rename
> ---------------------------------------------------------
>                 Key: HADOOP-6208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6208
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to affect any Hadoop
>            Reporter: Bradley Buda
>         Attachments: HADOOP-6208.patch, S3FSConsistencyPollingTest.java, S3FSConsistencyTest.java
> Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate' files, especially
when writing reduce outputs.  We've noticed this at tracksimple where we use the S3FS as the
direct input and output of our MapReduce jobs.  The symptom of this problem is a file in the
filesystem that is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc.
in length.
> The issue appears to be caused by renaming a file that has recently been written, and
getting a stale INode read from S3.  When a reducer is writing job output to the S3FS, the
normal series of S3 key writes for a 3-block file looks something like this:
> Task Output:
> 1) Write the first block (block_99)
> 2) Write an INode (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz)
containing [block_99]
> 3) Write the second block (block_81)
> 4) Rewrite the INode with new contents [block_99, block_81]
> 5) Write the last block (block_-101)
> 6) Rewrite the INode with the final contents [block_99, block_81, block_-101]
> Copy Output to Final Location (ReduceTask#copyOutput):
> 1) Read the INode contents from /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz,
which gives [block_99, block_81, block_-101]
> 2) Write the data from #1 to the final location, /myjob/part-00133.gz
> 3) Delete the old INode 
> The output file is truncated if S3 serves a stale copy of the temporary INode.  In copyOutput,
step 1 above, it is possible for S3 to return a version of the temporary INode that contains
just [block_99, block_81].  In this case, we write this new data to the final output location,
and 'lose' block_-101 in the process.  Since we then delete the temporary INode, we've lost
all references to the final block of this file and it's orphaned in the S3 bucket.
> This type of consistency error is infrequent but not impossible. We've observed these
failures about once a week for one of our large jobs which runs daily and has 200 reduce outputs;
so we're seeing an error rate of something like 0.07% per reduce.
> These kind of errors are generally difficult to handle in a system like S3.  We have
a few ideas about how to fix this:
> 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch up and make
these less likely.
> 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode until S3 says
the INode contents are the same as our local copy.  This could be a config option - "fs.s3.verifyInodeWrites"
or something like that.
> 3) Cache INode contents in-process, so we don't have to go back to S3 to ask for the
current version of an INode.
> 4) Only write INodes once, when the output stream is closed.  This would basically make
S3OutputStream#flush() a no-op.
> 5) Modify the S3FS to somehow version INodes (unclear how we would do this, need some
design work).
> 6) Avoid using the S3FS for temporary task attempt files.
> 7) Avoid using the S3FS completely.
> We wanted to get some guidance from the community before we went down any of these paths.
 Has anyone seen this issue?  Any other suggested workarounds?  We at tracksimple are willing
to invest some time in fixing this and (of course) contributing our fix back, but we wanted
to get an 'ack' from others before we try anything crazy :-).
> I've attached a test app if anyone wants to try and reproduce this themselves.  It takes
a while to run (depending on the 'weather' in S3 right now), but should eventually detect
a consistency 'error' that manifests itself as a truncated file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message