hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <>
Subject [jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths()
Date Thu, 05 Jan 2017 22:33:58 GMT


Sahil Takiar commented on HIVE-15546:

The code states:

If any input path points to an empty table or partition a dummy file in the scratch dir is
instead created and added to the list. This is needed to avoid special casing the operator
pipeline for these cases.

Am going to try and just remove this code and see what failures pop up.

> Optimize Utilities.getInputPaths()
> ----------------------------------
>                 Key: HIVE-15546
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
> When running on blobstores (like S3) where metadata operations (like listStatus) are
costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths
for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the path is
empty. If the path is empty, a dummy file is created for the given partition. This is all
done sequentially. This can be really slow when there are a lot of empty partitions. Even
when all partitions have input data, this can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle any edge cases
> (2) Multi-thread the listStatus calls

This message was sent by Atlassian JIRA

View raw message