impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailesh Mukil (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5378: Disk IO manager needs to understand ADLS
Date Wed, 31 May 2017 22:38:54 GMT
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5378: Disk IO manager needs to understand ADLS
......................................................................


Patch Set 1:

(2 comments)

> (2 comments)
 > 
 > questions:
 > - what about insert staging for adls (in coordinator.cc?

ADLS claims to have atomic renames. So we don't need to worry about that like we did for S3.

 > - what about hdfs-fs-cache, does that need to be extended?

I'm not sure which cache you mean, so I'll address both. The file handle cache at this point
doesn't support caching remote file handles. Also, we don't support SET CACHED for S3 and
ADLS at this point.

http://gerrit.cloudera.org:8080/#/c/7033/1/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 402:   // ADLS uses buffer sizes of 4k. Given that, and the above JNI array allocation
overhead
> you mean multiples of 4k?
I should have researched this a little better, I used 4k based on some misinformation. It
looks like the buffer size used is 4MB according to this:
https://docs.microsoft.com/en-us/java/api/com.microsoft.azure.datalake.store._a_d_l_file_input_stream

Also noticed a Hadoop JIRA which mentions better performance with higher buffer sizes:
https://issues.apache.org/jira/browse/HADOOP-14407


The pro of using a buffer size of 4M is obviously to be aligned with ADLS and avoid fragmentation.

The con however, is that we'd spend considerably more CPU allocating the JNI byte buffer and
also doing the memcpy.

What do you think would be better to settle for?


http://gerrit.cloudera.org:8080/#/c/7033/1/be/src/runtime/disk-io-mgr.h
File be/src/runtime/disk-io-mgr.h:

Line 764:   int RemoteADLSDiskId() const { return num_local_disks() + REMOTE_ADLS_DISK_OFFSET;
}
> RemoteAdlsDiskId
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7033
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I067f053fec941e3631610c5cc89a384f257ba906
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message