hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdullah Yousufi (JIRA)" <>
Subject [jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation
Date Mon, 15 Aug 2016 23:48:20 GMT


Abdullah Yousufi commented on HIVE-14165:

So I did try the listFiles() optimization locally and modified Hive to call the function on
the root directory of a partitioned table. While this does give a speedup for a select * query
on a partitioned table, this approach is not really extensible to queries that do partition
elimination, since in those cases it makes sense to just pass in the relevant partitions,
as Hive currently does.

I'm thinking it might make sense to remove the following list call on Hive in the case of
S3 partitioned tables since the listing for the split computation is going to happen later
anyway in Hadoop's
if (fs.exists(currPath)) {
  for (FileStatus fStat : listStatusUnderPath(fs, currPath)) {
    if (fStat.getLen() > 0) {
      return true;

My question is if it sounds good to remove this check. It seems that there may be errors that may return if the partition directory does not have any files,
but is there a better way to handle that?

> Enable faster S3 Split Computation
> ----------------------------------
>                 Key: HIVE-14165
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Abdullah Yousufi
>            Assignee: Abdullah Yousufi
> Split size computation be may improved by the optimizations for listFiles() in HADOOP-13208

This message was sent by Atlassian JIRA

View raw message