hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vihang Karajgaonkar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
Date Fri, 30 Sep 2016 19:49:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536870#comment-15536870
] 

Vihang Karajgaonkar commented on HIVE-14864:
--------------------------------------------

Unfortunately, the documentation of ContentSummary.getLength() says ... returns the length
:)

This is the implementation of getContentSummary() in FileSystem.java which suggests getLength
is actually returning the sum of lengths of all the files within that directory.

{noformat}
  public ContentSummary getContentSummary(Path f) throws IOException {
    FileStatus status = getFileStatus(f);
    if (status.isFile()) {
      // f is a file
      return new ContentSummary(status.getLen(), 1, 0);
    }
    // f is a directory
    long[] summary = {0, 0, 1};
    for(FileStatus s : listStatus(f)) {
      ContentSummary c = s.isDirectory() ? getContentSummary(s.getPath()) :
                                     new ContentSummary(s.getLen(), 1, 0);
      summary[0] += c.getLength();
      summary[1] += c.getFileCount();
      summary[2] += c.getDirectoryCount();
    }
    return new ContentSummary(summary[0], summary[1], summary[2]);
  }
{noformat}

These are the revelant constructors for ContentSummary.
{noformat}
  /** Constructor */
  public ContentSummary(long length, long fileCount, long directoryCount) {
    this(length, fileCount, directoryCount, -1L, length, -1L);
  }

  public ContentSummary(
      long length, long fileCount, long directoryCount, long quota,
      long spaceConsumed, long spaceQuota) {
    this.length = length;
    this.fileCount = fileCount;
    this.directoryCount = directoryCount;
    this.quota = quota;
    this.spaceConsumed = spaceConsumed;
    this.spaceQuota = spaceQuota;
  }
{noformat}


> Distcp is not called from MoveTask when src is a directory
> ----------------------------------------------------------
>
>                 Key: HIVE-14864
>                 URL: https://issues.apache.org/jira/browse/HIVE-14864
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Vihang Karajgaonkar
>            Assignee: Vihang Karajgaonkar
>
> In FileUtils.java the following code does not get executed even when src directory size
is greater than HIVE_EXEC_COPYFILE_MAXSIZE because 
> srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We should use srcFS.getContentSummary(src).getLength()
instead.
> {noformat}
>     /* Run distcp if source file/dir is too big */
>     if (srcFS.getUri().getScheme().equals("hdfs") &&
>         srcFS.getFileStatus(src).getLen() > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE))
{
>       LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. (MAX: " +
conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + ")");
>       LOG.info("Launch distributed copy (distcp) job.");
>       HiveConfUtil.updateJobCredentialProviders(conf);
>       copied = shims.runDistCp(src, dst, conf);
>       if (copied && deleteSource) {
>         srcFS.delete(src, true);
>       }
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message