hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <>
Subject [jira] [Commented] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs
Date Wed, 28 Nov 2018 00:37:00 GMT


Hive QA commented on HIVE-20330:

Here are the results of testing the latest attachment:

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 15553 tests passed

Test results:
Console output:
Test logs:

Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase

This message is automatically generated.

ATTACHMENT ID: 12949747 - PreCommit-HIVE-Build

> HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs
> -------------------------------------------------------------------------------------
>                 Key: HIVE-20330
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: HIVE-20330.0.patch, HIVE-20330.1.patch, HIVE-20330.2.patch, HIVE-20330.3.patch,
HIVE-20330.4.patch, HIVE-20330.5.patch, HIVE-20330.6.patch
> While running performance tests on Pig (0.12 and 0.17) we've observed a huge performance
drop in a workload that has multiple inputs from HCatLoader.
> The reason is that for a particular MR job with multiple Hive tables as input, Pig calls
{{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only one table's information
(InputJobInfo instance) gets tracked in the JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).
> Any such call overwrites preexisting values, and thus only the last table's information
will be considered when Pig calls {{getStatistics}} to calculate and estimate required reducer
> In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig will
query the size information from HCat for both of them, but it will either see 1MB+1MB=2MB
or 256GB+256GB=0.5TB depending on input order in the execution plan's DAG.
> It should of course see 256.00097GB in total and use 257 reducers by default accordingly.
> In unlucky cases this will be seen as 2MB and 1 reducer will have to struggle with the
actual 256.00097GB...

This message was sent by Atlassian JIRA

View raw message