impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mostafa Mokhtar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IMPALA-4789) Slow metadata loading with many partitions that have inconsistent HDFS path qualification
Date Wed, 22 Mar 2017 18:34:41 GMT

    [ https://issues.apache.org/jira/browse/IMPALA-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936886#comment-15936886
] 

Mostafa Mokhtar commented on IMPALA-4789:
-----------------------------------------

IMPALA-5042 makes loading of custom and regular partitions on par. 

> Slow metadata loading with many partitions that have inconsistent HDFS path qualification
> -----------------------------------------------------------------------------------------
>
>                 Key: IMPALA-4789
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4789
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.8.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Blocker
>              Labels: performance, regression
>             Fix For: Impala 2.9.0
>
>
> The fix for IMPALA-4172/IMPALA-3653 introduced a performance regression for loading tables
that have many partitions with:
> 1. inconsistent HDFS path qualification or
> 2. a custom location (not under the table root dir)
> For the first issue consider a table whose root path is at *'hdfs://localhost:8020/warehouse/tbl/'*.
> A partition with an unqualified location *'/warehouse/tbl/p=1'* will not be recognized
as being a descendant of the table root dir by FileSystemUtil.isDescendentPath() because of
how Path.equals() behaves, even if *'hdfs://localhost:8020'* is the default filesystem.
> Such partitions are incorrectly recognized as having a custom location and are treated
specially. The treatment of such partitions is very inefficient, as show in the following
code snippets:
> HdfsTable.loadAllPartitions():
> {code}
> ...
>         if (!dirsToLoad.contains(partDir) &&
>             !FileSystemUtil.isDescendantPath(partDir, tblLocation)) { <--- this condition
will fail
>           // This partition has a custom filesystem location. Load its file/block
>           // metadata separately by adding it to the list of dirs to load.
>           dirsToLoad.add(partDir);
>         }
> ...
> {code}
> HdfsTable.loadMetadataAndDiskIds() calls HdfsTable.loadBlockMetadata() once for every
location:
> {code}
>   private void loadMetadataAndDiskIds(List<Path> locations,
>       HashMap<Path, List<HdfsPartition>> partsByPath) {
>     LOG.info(String.format("Loading file and block metadata for %s partitions: %s",
>         partsByPath.size(), getFullName()));
>     for (Path location: locations) { loadBlockMetadata(location, partsByPath); }
>     LOG.info(String.format("Loaded file and block metadata for %s partitions: %s",
>         partsByPath.size(), getFullName()));
>   }
> {code}
> HdfsTable.loadBlockMetadata():
> {code}
> ...
>       // Clear the state of partitions under dirPath since they are now updated based
>       // on the current snapshot of files in the directory.
>       for (Map.Entry<Path, List<HdfsPartition>> entry: partsByPath.entrySet())
{ <--- partsByPath has an entry for every partition in the table
>         Path partDir = entry.getKey();
>         if (!FileSystemUtil.isDescendantPath(partDir, dirPath)) continue;
>         for (HdfsPartition partition: entry.getValue()) {
>           partition.setFileDescriptors(new ArrayList<FileDescriptor>());
>         }
>       }
> ...
> {code}
> As a result, it means that we will call isDescendentPath() roughly #numLocations * #totalPartitions
times which can add up fast for tables with many partitions.
> There are two issues to fix here:
> 1. The bug in recognizing partitions under the root table dir (for inconsistent qualification
of table/partition locations)
> 2. The expensive loop for partitions with custom locations (even if legitimately custom)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message