impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4029: Reduce memory requirements for storing file metadata
Date Wed, 05 Apr 2017 19:23:28 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-4029: Reduce memory requirements for storing file metadata

Patch Set 3:

File common/fbs/CatalogObjects.fbs:

Line 36:   offset: long = 0 (id: 0);
Why a default value? Seems potentially dangerous.

Not for this change, but I'm thinking we don't even need to store the offset. If we know the
block size, than we can derive the offset (assuming the list of file blocks is ordered by
offset). Might be worth adding a TODO or recording that idea somewhere.

Line 40:   length: long = -1 (id: 1);
Seems redundant, why keep it?

Line 65:   compression: FbCompression (id: 3);
Will FlatBuffers add padding to align members? Ideally, we'd optimize for space and not access
File common/thrift/CatalogObjects.thrift:

Line 218:   1: required binary file_desc_data

Line 296:   10: optional list<string> file_name_prefixes
Is this required for the move to flat buffers? I think we should consider an HdfsPathSet abstraction
that assigns path ids and internally compresses the underlying strings. There's a lot of manual
lookups and stitching in the current code. I don't feel too strongly about whether we should
do that now, or clean up the code later.
File fe/src/main/java/org/apache/impala/catalog/

Line 78:   public byte toFbCompression() {
File fe/src/main/java/org/apache/impala/catalog/

Line 126:         locations = fileSystem.getFileBlockLocations(fileStatus, 0, fileStatus.getLen());
I think it would be better for now to keep all the code that fetches information from external
systems in one place (HdfsTable). Splitting up the loading and delegating to several classes
may make sense, but that probably requires significant surgery, and the current fetching code
is very much centralized (we iterate over all files of in a table).

The loading code in this patch is more confusing to me than before. The meaning of some verbs
like load/create is less clear.

If you agree with that direction, we may not need a FileDescriptor class at all, and can only
rely on the FB to hold the data. It may still make sense to have a FileDescBuilder which you
can use to construct a FbFileDesc.

Line 227:             loc.getNames().length);
Another case where we might be calling out to the NN.

Line 321:     private static int REPLICA_HOST_CACHE_MASK = 0x8000;
Less code and more readable to have one var with (1 << 15). The places that need to
& the mask can bit-wise invert, i.e. (x & ~MASK).
File fe/src/main/java/org/apache/impala/catalog/

Line 189:   // List of file name prefixes.
Sorted list of file name prefixes?

Line 309:         Pair<Integer, String> compressedFileName =
How predictable is the compression behavior? Do we iterate over the files in lexicographical
order for both HDFS and S3?
I'm worried about the case where an "invalidate metadata" suddenly leads to a higher memory
requirement even if no files/partitions have changed.

Line 340:    * the suffix is equal to 'fileName'.
Should mention that this function expects the fileNames to be sorted.

Line 346:     String commonPrefix = Strings.commonPrefix(prevFileName, fileName);
I'm wondering if it's better to first check the last common prefix instead of the prefix between
the current and prev file name. In the current version it seems like the list of prefixes
could contain strings which are a common prefix of another one.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message