impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bharath Vissapragada (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
Date Thu, 01 Dec 2016 00:41:12 GMT
Hello Dimitris Tsirogiannis, Alex Behm,

I'd like you to reexamine a change.  Please visit

to look at the new patch set (#16).

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

This patch improves the block metadata loading (locations and disk
storage IDs) for partitioned and un-partitioned tables in the Catalog

Without this patch:
We loop through each and every file in the table/partition directories
and call getFileBlockLocations() on it to obtain the block metadata.
This results in large number of RPC calls to the Namenode, especially
for tables with large no. of files/partitions.

With this patch:
We move the block metadata querying to use listStatus() call which
accepts a directory as input and fetches the 'BlockLocation' objects
for every file recursively in that directory. This improves the
metadata loading in the following ways.

- For non-partitioned tables, we query all the BlockLocations in a
single RPC call in the base table directory and load the corresponding
disk IDs.

- For partitioned tables, we query the BlockLocations for all the
partitions residing under the base table directories in a single RPC
and then load every partition with non-default partition directory

- REFRESH on a table reloads the block metadata from scratch for
every data file every time. So it can be used as a replacement for
invalidate in situations like HDFS block rebalancing which needs
block metadata update.

Also, this patch does away with VolumeIds returned by the HDFS NN
and uses the new StorageIDs returned by the BlockLocation class.
These StorageIDs are UUID strings and hence are mapped to a
per-node 0-based index as expected by the backend. In the upcoming
versions of Hadoop APIs, getFileBlockStorageLocations() is deprecated
and instead the listStatus() returns BlockLocations with storage IDs
embedded. This patch makes use of this improvement to reduce an
additional RPC to the NN to fetch the storage locations.

Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
M fe/src/main/java/org/apache/impala/catalog/
A fe/src/main/java/org/apache/impala/catalog/
M fe/src/main/java/org/apache/impala/catalog/
M fe/src/main/java/org/apache/impala/catalog/
M fe/src/main/java/org/apache/impala/common/
M fe/src/main/java/org/apache/impala/service/
6 files changed, 365 insertions(+), 338 deletions(-)

  git pull ssh:// refs/changes/48/5148/16
To view, visit
To unsubscribe, visit

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Mostafa Mokhtar <>

View raw message