hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth Jayachandran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-21040) msck does unnecessary file listing at last level of directory tree
Date Thu, 20 Dec 2018 01:00:57 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725484#comment-16725484
] 

Prasanth Jayachandran commented on HIVE-21040:
----------------------------------------------

+1, pending tests.

> msck does unnecessary file listing at last level of directory tree
> ------------------------------------------------------------------
>
>                 Key: HIVE-21040
>                 URL: https://issues.apache.org/jira/browse/HIVE-21040
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>         Attachments: HIVE-21040.01.patch, HIVE-21040.02.patch
>
>
> Here is the code snippet which is run by {{msck}} to list directories
> {noformat}
> final Path currentPath = pd.p;
>       final int currentDepth = pd.depth;
>       FileStatus[] fileStatuses = fs.listStatus(currentPath, FileUtils.HIDDEN_FILES_PATH_FILTER);
>       // found no files under a sub-directory under table base path; it is possible that
the table
>       // is empty and hence there are no partition sub-directories created under base
path
>       if (fileStatuses.length == 0 && currentDepth > 0 && currentDepth
< partColNames.size()) {
>         // since maxDepth is not yet reached, we are missing partition
>         // columns in currentPath
>         logOrThrowExceptionWithMsg(
>             "MSCK is missing partition columns under " + currentPath.toString());
>       } else {
>         // found files under currentPath add them to the queue if it is a directory
>         for (FileStatus fileStatus : fileStatuses) {
>           if (!fileStatus.isDirectory() && currentDepth < partColNames.size())
{
>             // found a file at depth which is less than number of partition keys
>             logOrThrowExceptionWithMsg(
>                 "MSCK finds a file rather than a directory when it searches for "
>                     + fileStatus.getPath().toString());
>           } else if (fileStatus.isDirectory() && currentDepth < partColNames.size())
{
>             // found a sub-directory at a depth less than number of partition keys
>             // validate if the partition directory name matches with the corresponding
>             // partition colName at currentDepth
>             Path nextPath = fileStatus.getPath();
>             String[] parts = nextPath.getName().split("=");
>             if (parts.length != 2) {
>               logOrThrowExceptionWithMsg("Invalid partition name " + nextPath);
>             } else if (!parts[0].equalsIgnoreCase(partColNames.get(currentDepth))) {
>               logOrThrowExceptionWithMsg(
>                   "Unexpected partition key " + parts[0] + " found at " + nextPath);
>             } else {
>               // add sub-directory to the work queue if maxDepth is not yet reached
>               pendingPaths.add(new PathDepthInfo(nextPath, currentDepth + 1));
>             }
>           }
>         }
>         if (currentDepth == partColNames.size()) {
>           return currentPath;
>         }
>       }
> {noformat}
> You can see that when the {{currentDepth}} at the {{maxDepth}} it still does a unnecessary
listing of the files. We can improve this call by checking the currentDepth and bailing out
early.
> This can improve the performance of msck command significantly especially when there
are lot of files in each partitions on remote filesystems like S3 or ADLS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message