hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Mackrory (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-14036) S3Guard: intermittent duplicate item keys failure
Date Wed, 15 Mar 2017 01:01:53 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925351#comment-15925351
] 

Sean Mackrory edited comment on HADOOP-14036 at 3/15/17 12:59 AM:
------------------------------------------------------------------

Not proposing this for inclusion just yet (although it's possible this is precisely the correct
solution), but just a proof-of-concept of the problem. I see paths getting added to the containers
of objects to move here in the loop I'm modifying, and then also down below the comment, "We
moved all the children, now move the top-level dir."

I should dig a bit into the listObjects call, as I'm curious why we don't have this problem
with a lot more tests / workloads that involve renames. I'm also not entirely sure we do actually
have to move the top-level dir last (although my current fix ensures that it is added last).
If the move isn't atomic, the invariant that parent paths always exist is going to be violated
for either the new path or the old path sometime, and this particular operation is just adding
it to the collection to be broken into batches. Seems cleaner IMO to do it last like we do,
but I want to think through it a bit more. Speak up if you have any insight or opinions there...

After applying this fix (checking if the directory we're adding matches the name of the parent
directory we add separately at the end, and then skipping that part if it does), I was able
to run that test over and over again without problems, and after reverting it reproduced the
issue at least 50% of the time. On one run I had a bunch of failures listed below, and I'm
positive no other workload was using that bucket at the time, but I've been able to run each
of those tests successfully and do several more full runs without a problem:

{code}
Failed tests: 
  ITestS3GuardToolDynamoDB>S3GuardToolTestBase.testPruneCommandConf:157->S3GuardToolTestBase.testPruneCommand:135->Assert.assertEquals:542->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
expected:<2> but was:<1>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusFiltering:499->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
length of listStatus(s3a://mackrory/test, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@69b9805a
) expected:<2> but was:<1>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:466->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
length of listStatus(s3a://mackrory/test, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$MatchesNameFilter@4ce8f437
) expected:<1> but was:<0>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testComplexDirActions:143->AbstractContractGetFileStatusTest.checkListStatusStatusComplexDir:162->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
listStatus(): file count in 1 directory and 0 files expected:<4> but was:<0>

Tests in error: 
  ITestS3AEncryptionSSEKMSDefaultKey>AbstractTestS3AEncryption.testEncryptionOverRename:71
» FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testReadSmallFile:531 » FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testNegativeSeek:181 » FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testSeekFile:207 » FileNotFound ...
  ITestS3AContractSeek>AbstractContractSeekTest.testReadFullyPastEOF:467 » FileNotFound
  ITestS3AContractDistCp>AbstractContractDistCpTest.deepDirectoryStructureToRemote:90->AbstractContractDistCpTest.deepDirectoryStructure:139
» FileNotFound
  ITestS3AContractDistCp>AbstractContractDistCpTest.largeFilesToRemote:96->AbstractContractDistCpTest.largeFiles:174
» FileNotFound
{code}


was (Author: mackrorysd):
Not proposing this for inclusion just yet (although it's possible this is precisely the correct
solution), but just a proof-of-concept of the problem. I see paths getting added to the containers
of objects to move here in the loop I'm modifying, and then also down below the comment, "We
moved all the children, now move the top-level dir."

I should dig a bit into the listObjects call, as I'm curious why we don't have this problem
with a lot more tests / workloads that involve renames. I'm also not entirely sure we do actually
have to move the top-level dir last (although my current fix ensures that it is added last).
If the move isn't atomic, the invariant that parent paths always exist is going to be violated
for either the new path or the old path sometime, and this particular operation is just adding
it to the collection to be broken into batches. Seems cleaner IMO to do it last like we do,
but I want to think through it a bit more. Speak up if you have any insight or opinions there...

> S3Guard: intermittent duplicate item keys failure
> -------------------------------------------------
>
>                 Key: HADOOP-14036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14036
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: HADOOP-13345
>            Reporter: Aaron Fabbri
>            Assignee: Mingliang Liu
>         Attachments: HADOOP-14036-HADOOP-13345.000.patch
>
>
> I see this occasionally when running integration tests with -Ds3guard -Ddynamo:
> {noformat}
> testRenameToDirWithSamePrefixAllowed(org.apache.hadoop.fs.s3a.ITestS3AFileSystemContract)
 Time elapsed: 2.756 sec  <<< ERROR!
> org.apache.hadoop.fs.s3a.AWSServiceIOException: move: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException:
Provided list of item keys contains duplicates (Service: AmazonDynamoDBv2; Status Code: 400;
Error Code: ValidationException; Request ID: QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG):
Provided list of item keys contains duplicates (Service: AmazonDynamoDBv2; Status Code: 400;
Error Code: ValidationException; Request ID: QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG)
>         at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
>         at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.move(DynamoDBMetadataStore.java:408)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:869)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:662)
>         at org.apache.hadoop.fs.FileSystemContractBaseTest.rename(FileSystemContractBaseTest.java:525)
>         at org.apache.hadoop.fs.FileSystemContractBaseTest.testRenameToDirWithSamePrefixAllowed(FileSystemContractBaseTest.java:669)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message