impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailesh Mukil (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Date Mon, 28 Mar 2016 21:56:17 GMT
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems

Patch Set 5:

File be/src/exec/hdfs-table-sink.h:

Line 47: (
> remove
File be/src/runtime/

Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
> is this per partition or per filesystem? Not clear here what role this play
It's a connection cache for all the filesystems that the partitions connect to. So I guess
it makes more sense to call it filesystem_connection_cache. Changed it.

Line 772: over
> in or on, not over (you partition over a partition function).

Line 772: allows to
> "allow us to" or better "allows writes to tables ..."

Line 773: So we need to open connections to different filesystems as necessary
> this sentence should be where you explain the "one cnxn per fs" requirement
I removed mention of the requirement. See my comment below for explanation.

Line 774: the connections
> "one connection per filesystem"

Line 776: we don't want to open multiple
        :     // connections to the same filesystem. Doing so could cause inconsistencies
in the
        :     // filesystem and return with errors
> this is vague - what could go wrong?
I spent all afternoon trying to recreate the errors that I saw earlier, so that I could write
a better description of what could go wrong if we use multiple connections to a FS. However,
I wasn't able to reproduce the errors, which leads me to believe that the errors I saw before
were probably because of some other bugs which I fixed later.
So I'm removing mention of the "inconsistencies", however, it's still better to cache connections
(so that we're not unnecessarily creating too many connections at once).

Line 772: InsertStmt allows to insert in tables that have partitions over multiple filesystems.
        :     // So we need to open connections to different filesystems as necessary.
        :     // We use a local connection cache and populate it with the connections to the
        :     // filesystems before we do the actual operations (copy, move, delete, etc.).
This is
        :     // because the operations are done in parallel and we don't want to open multiple
        :     // connections to the same filesystem. Doing so could cause inconsistencies
in the
        :     // filesystem and return with errors.
> This comment as a whole would be better before line 754.
I've removed this comment. See CR comment below.

Line 779: partition_connection
> partition_fs_connection (or just fs_connection)

> does this leave behind any cleanup that needs to be done?
No the context of everything that happens in this loop is local. And none of the operations
actually start until after this loop execution is completed. So returning on an error here
would not leave any uncleaned state.

Line 798: 
> this blank line can go

Line 799: bool is_s3_path = false;
        :     if (IsS3APath(part_path.c_str())) is_s3_path = true;
> write this as:

Line 823:  
> indentation (not your change)
isn't it already 2 tabs?

Line 846: !is_s3_path
> move this into the if on line 843, save a level of nesting

Line 864: if (!is_s3_path
> again, combine with above line
File be/src/util/

Line 55: Status::OK();
> combine declaration with line 57

Line 60: if (connection_status.ok()) {
> instead of this indentation, make a method call "AddError(const string& err
File be/src/util/hdfs-bulk-ops.h:

Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* local_connection_cache)
        :     DCHECK(local_connection_cache != NULL);
        :     local_connection_cache_ = local_connection_cache;
        :   }
> how about just making this a parameter to Execute() - is it needed anywhere
No, I've removed this and moved it to Execute().

Line 137: local
> what does it mean to be 'local'? Who owns this object?
I see the confusion. I've removed mentions of 'local' and renamed this field. It's local with
respect to the process wide global connection cache. But this class doesn't care about that.
Also, it's not owned. The caller should supply the connection cache.
File be/src/util/

Line 76: const char* src, const char* to_check
> does this ever get called with char*s that aren't derived from string objec
Not currently, no.

Line 77: scheme of our filesystem support
> grammar

Line 79: strncmp
> this needs fixing. You need to extract the scheme (by searching for "://" I

Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) {
       :   return !IsPathOnFilesystem(pathA, pathB);
       : }
> remove this method. Have the caller just call IsPathOnFilesystem().
File be/src/util/hdfs-util.h:

Line 51: to_check
> 'to_check' doesn't really tell me what this is (and the type doesn't help).

Line 51: IsPathOnFilesystem
> All paths are on some filesystem. Call this "FilesystemsMatch" or something
I've accommodated for that now. Currently, our default FS assumption is HDFS. So if no scheme
is present, then we assume it's a HDFS path.
We need to change our defaultFS assumptions once IMPALA-1850 goes in. I'll talk to Anuj about
that and make sure this patch also does away with that assumption.
File common/thrift/ImpalaInternalService.thrift:

Line 390:   // Full path to the base directory for this partition.
> be explicit about what this includes. Does it have the scheme? Is it a path
File fe/src/main/java/com/cloudera/impala/common/

Line 173:       destFs.rename(sourceFile, destFile);
> for simplicity, return here and then save the else { } block below

Line 175: isDestDfs && sameFileSystem
> what you actually mean here is if (!arePathsInSameEncryptionZone(...)) whic
I didn't do it this way earlier because if we called arePathsInSameEncryptionZone() for S3,
it would crash. But I've added a check in arePathsInSameEncryptionZone() to see if the paths
are on HDFS and return false otherwise.

Line 187:           true, true, CONF);
> can you get more of this on the previous line?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil <>
Gerrit-Reviewer: Henry Robinson <>
Gerrit-Reviewer: Michael Brown <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Sailesh Mukil <>
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-HasComments: Yes

View raw message