hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Poepping (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15546) Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
Date Thu, 19 Jan 2017 00:03:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829016#comment-15829016
] 

Thomas Poepping commented on HIVE-15546:
----------------------------------------

I see how that could make sense -- just have the executor treat the empty partition as it
would any other, by getting all files and parsing. It's just in the case of an empty partition,
an empty file is used.

Seems fine to me. I also took a look at the RB, no problems there. Non-binding +1 from me.

> Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-15546
>                 URL: https://issues.apache.org/jira/browse/HIVE-15546
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Hive
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>         Attachments: HIVE-15546.1.patch, HIVE-15546.2.patch, HIVE-15546.3.patch, HIVE-15546.4.patch,
HIVE-15546.5.patch
>
>
> When running on blobstores (like S3) where metadata operations (like listStatus) are
costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths
for an MR / Spark / Tez job.
> The method performs a listStatus on all input paths in order to check if the path is
empty. If the path is empty, a dummy file is created for the given partition. This is all
done sequentially. This can be really slow when there are a lot of empty partitions. Even
when all partitions have input data, this can take a long time.
> We should either:
> (1) Just remove the logic to check if each input path is empty, and handle any edge cases
accordingly.
> (2) Multi-thread the listStatus calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message