hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
Date Thu, 03 Dec 2020 09:26:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-24444?focusedWorklogId=519505&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519505
]

ASF GitHub Bot logged work on HIVE-24444:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Dec/20 09:25
            Start Date: 03/Dec/20 09:25
    Worklog Time Spent: 10m 
      Work Description: pvargacl commented on a change in pull request #1716:
URL: https://github.com/apache/hive/pull/1716#discussion_r534995678



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##########
@@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList,
Compa
       }
       fs.delete(dead, true);
     }
-    return true;
+    // Check if there will be more obsolete directories to clean when possible. We will only
mark cleaned when this
+    // number reaches 0.
+    return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0;
+  }
+
+  /**
+   * Get the number of base/delta directories the Cleaner should remove eventually. If we
check this after cleaning
+   * we can see if the Cleaner has further work to do in this table/partition directory that
it hasn't been able to
+   * finish, e.g. because of an open transaction at the time of compaction.
+   * We do this by assuming that there are no open transactions anywhere and then calling
getAcidState. If there are
+   * obsolete directories, then the Cleaner has more work to do.
+   * @param location location of table
+   * @return number of dirs left for the cleaner to clean – eventually
+   * @throws IOException
+   */
+  private int getNumEventuallyObsoleteDirs(String location, Map<Path, AcidUtils.HdfsDirSnapshot>
dirSnapshots)
+      throws IOException {
+    ValidTxnList validTxnList = new ValidReadTxnList();
+    //save it so that getAcidState() sees it
+    conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString());
+    ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList();
+    Path locPath = new Path(location);
+    AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath,
conf, validWriteIdList,
+        Ref.from(false), false, dirSnapshots);
+    return dir.getObsolete().size();

Review comment:
       In upstream it would cause problems, for the delayed cleaner it is necessary that every
compaction entry cleans only its own obsoletes. The default is every compaction is delayed
by 15 minutes, so it is much more likely when a cleaning job finally runs, there are already
some other jobs in the queue waiting. all of them contains highestwriteId saved, and they
will clean only up until that. You should not merge those. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 519505)
    Time Spent: 6h  (was: 5h 50m)

> compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in
the FS
> -------------------------------------------------------------------------------------------
>
>                 Key: HIVE-24444
>                 URL: https://issues.apache.org/jira/browse/HIVE-24444
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 6h
>  Remaining Estimate: 0h
>
> This is an improvement on HIVE-24314, in which markCleaned() is called only if +any+
files are deleted by the cleaner. This could cause a problem in the following case:
> Say for table_1 compaction1 cleaning was blocked by an open txn, and compaction is run
again on the same table (compaction2). Both compaction1 and compaction2 could be in "ready
for cleaning" at the same time. By this time the blocking open txn could be committed. When
the cleaner runs, one of compaction1 and compaction2 will remain in the "ready for cleaning"
state:
> Say compaction2 is picked up by the cleaner first. The Cleaner deletes all obsolete files.
 Then compaction1 is picked up by the cleaner; the cleaner doesn't remove any files and compaction1
will stay in the queue in a "ready for cleaning" state.
> HIVE-24291 already solves this issue but if it isn't usable (for example if HMS schema
changes are out the question) then HIVE-24314 + this change will fix the issue of the Cleaner
not removing all obsolete files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message