impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailesh Mukil (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Date Mon, 28 Mar 2016 21:56:17 GMT
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
......................................................................


Patch Set 5:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 47: (
> remove
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

Line 754: HdfsFsCache::HdfsFsMap partition_connection_cache;
> is this per partition or per filesystem? Not clear here what role this play
It's a connection cache for all the filesystems that the partitions connect to. So I guess
it makes more sense to call it filesystem_connection_cache. Changed it.


Line 772: over
> in or on, not over (you partition over a partition function).
Done


Line 772: allows to
> "allow us to" or better "allows writes to tables ..."
Done


Line 773: So we need to open connections to different filesystems as necessary
> this sentence should be where you explain the "one cnxn per fs" requirement
I removed mention of the requirement. See my comment below for explanation.


Line 774: the connections
> "one connection per filesystem"
Done


Line 776: we don't want to open multiple
        :     // connections to the same filesystem. Doing so could cause inconsistencies
in the
        :     // filesystem and return with errors
> this is vague - what could go wrong?
I spent all afternoon trying to recreate the errors that I saw earlier, so that I could write
a better description of what could go wrong if we use multiple connections to a FS. However,
I wasn't able to reproduce the errors, which leads me to believe that the errors I saw before
were probably because of some other bugs which I fixed later.
So I'm removing mention of the "inconsistencies", however, it's still better to cache connections
(so that we're not unnecessarily creating too many connections at once).


Line 772: InsertStmt allows to insert in tables that have partitions over multiple filesystems.
        :     // So we need to open connections to different filesystems as necessary.
        :     // We use a local connection cache and populate it with the connections to the
necessary
        :     // filesystems before we do the actual operations (copy, move, delete, etc.).
This is
        :     // because the operations are done in parallel and we don't want to open multiple
        :     // connections to the same filesystem. Doing so could cause inconsistencies
in the
        :     // filesystem and return with errors.
> This comment as a whole would be better before line 754.
I've removed this comment. See CR comment below.


Line 779: partition_connection
> partition_fs_connection (or just fs_connection)
Done


Line 780: RETURN_IF_ERROR
> does this leave behind any cleanup that needs to be done?
No the context of everything that happens in this loop is local. And none of the operations
actually start until after this loop execution is completed. So returning on an error here
would not leave any uncleaned state.


Line 798: 
> this blank line can go
Done


Line 799: bool is_s3_path = false;
        :     if (IsS3APath(part_path.c_str())) is_s3_path = true;
> write this as:
Done


Line 823:  
> indentation (not your change)
isn't it already 2 tabs?


Line 846: !is_s3_path
> move this into the if on line 843, save a level of nesting
Done


Line 864: if (!is_s3_path
> again, combine with above line
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.cc
File be/src/util/hdfs-bulk-ops.cc:

Line 55: Status::OK();
> combine declaration with line 57
Done


Line 60: if (connection_status.ok()) {
> instead of this indentation, make a method call "AddError(const string& err
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-bulk-ops.h
File be/src/util/hdfs-bulk-ops.h:

Line 120: void set_local_connection_cache(HdfsFsCache::HdfsFsMap* local_connection_cache)
{
        :     DCHECK(local_connection_cache != NULL);
        :     local_connection_cache_ = local_connection_cache;
        :   }
> how about just making this a parameter to Execute() - is it needed anywhere
No, I've removed this and moved it to Execute().


Line 137: local
> what does it mean to be 'local'? Who owns this object?
I see the confusion. I've removed mentions of 'local' and renamed this field. It's local with
respect to the process wide global connection cache. But this class doesn't care about that.
Also, it's not owned. The caller should supply the connection cache.


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.cc
File be/src/util/hdfs-util.cc:

Line 76: const char* src, const char* to_check
> does this ever get called with char*s that aren't derived from string objec
Not currently, no.


Line 77: scheme of our filesystem support
> grammar
Done


Line 79: strncmp
> this needs fixing. You need to extract the scheme (by searching for "://" I
Done


Line 82: bool SpansMultipleFilesystems(const char* pathA, const char* pathB) {
       :   return !IsPathOnFilesystem(pathA, pathB);
       : }
> remove this method. Have the caller just call IsPathOnFilesystem().
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

Line 51: to_check
> 'to_check' doesn't really tell me what this is (and the type doesn't help).
Done


Line 51: IsPathOnFilesystem
> All paths are on some filesystem. Call this "FilesystemsMatch" or something
I've accommodated for that now. Currently, our default FS assumption is HDFS. So if no scheme
is present, then we assume it's a HDFS path.
We need to change our defaultFS assumptions once IMPALA-1850 goes in. I'll talk to Anuj about
that and make sure this patch also does away with that assumption.


http://gerrit.cloudera.org:8080/#/c/2574/5/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

Line 390:   // Full path to the base directory for this partition.
> be explicit about what this includes. Does it have the scheme? Is it a path
Done


http://gerrit.cloudera.org:8080/#/c/2574/5/fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
File fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java:

Line 173:       destFs.rename(sourceFile, destFile);
> for simplicity, return here and then save the else { } block below
Done


Line 175: isDestDfs && sameFileSystem
> what you actually mean here is if (!arePathsInSameEncryptionZone(...)) whic
I didn't do it this way earlier because if we called arePathsInSameEncryptionZone() for S3,
it would crash. But I've added a check in arePathsInSameEncryptionZone() to see if the paths
are on HDFS and return false otherwise.


Line 187:           true, true, CONF);
> can you get more of this on the previous line?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2574
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message