impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sailesh Mukil (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-2.6.0_5.8.0) IMPALA-3577, IMPALA-3486: Partitions on multiple filesystems breaks with S3_SKIP_INSERT_STAGING
Date Mon, 23 May 2016 23:26:54 GMT
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3577, IMPALA-3486: Partitions on multiple filesystems breaks with S3_SKIP_INSERT_STAGING
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3146/7/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 298:   RETURN_IF_ERROR(HdfsFsCache::instance()->GetConnection(
> it's not obvious why this is correct.  What's wrong with the old location, 
The problem arises from this. In BuildHdfsFileNames(), the temporary file and the final file
names get their schemes from different sources:

tmp_hdfs_file_name_prefix:
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-table-sink.cc#L258

where staging_dir_ get's the scheme from the base dir:
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-table-sink.cc#L137

If we explicitly specify a partition location, final_hdfs_file_name_prefix:
https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-table-sink.cc#L271


So, the tmp file name gets the location from the base table and the final file name gets the
location from the user specified location (if specified).

However, in previous patchsets, we got the connection at L389 (just after the call to BuildHdfsFileNames()).
And we can only either get it based on tmp_hdfs_file_name_prefix or final_hdfs_file_name_prefix.

If we choose to get a connection to 'tmp_hdfs_file_name_prefix' for a table on HDFS with a
partition on S3, and skip insert staging for S3, we will be trying to write to S3 with a connection
to HDFS. (Because tmp_hdfs_file_name_prefix always points to the base table.)

If we choose to get a connection to 'final_hdfs_file_name_prefix' for a table on HDFS with
a partition on S3, and we do not skip insert staging for S3, we will be trying to write to
HDFS (as the staging dir will be on HDFS) with a connection to S3.

So the only option I saw was to get the connection to "current_file_name" as that is the final
file we end up writing to.


-- 
To view, visit http://gerrit.cloudera.org:8080/3146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib13b610eb9efb68c83894786cea862d7eae43aa7
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.6.0_5.8.0
Gerrit-Owner: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message