hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marta Kuczora <kuczo...@cloudera.com>
Subject Re: Review Request 60432: HIVE-16845: INSERT OVERWRITE a table with dynamic partitions on S3 fails with NPE
Date Thu, 13 Jul 2017 15:00:55 GMT


> On July 11, 2017, 11:14 p.m., Sahil Takiar wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
> > Lines 305-308 (patched)
> > <https://reviews.apache.org/r/60432/diff/2/?file=1763797#file1763797line305>
> >
> >     Is the if statement necessary? Is there a problem with always doing this?
> >     
> >     I think it would make more sense to always do this. Right now the code assumes
`mvTask.getWork() == mrAnvMvTask.getChildTasks().get(0).getWork()` which doesn't seem like
a smart idea. If `mvAndMvTask` is the tasks that is going to be added to `resTask` then it
would make more sense to just use the `MoveWork` from `mvAndMvTask.getChildTasks.get(0)`
> 
> Marta Kuczora wrote:
>     Thanks a lot for the review Sahil!
>     I absolutely agree with you that it would be better to get the move task from mergeAndMoveMoveTask
always. The reason why I added the if is that I wasn't 100% sure if there cannot be any use-case
when this approach is not working. So I wanted to be cautious and only change the behavior
for the blobstore optimization use-case and leave it as it was for the other use-cases. But
if you think that this caution is not necessary, I an happy to remove the if condition. 
>     Do you think it would make sense to upload a patch where I always use the child task
and see if all tests are successful in the pre-commit run?

Updated the patch accordingly and the pre-commit tests were successful except for 5 tests
which are failing for a long time.


- Marta


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60432/#review180254
-----------------------------------------------------------


On July 13, 2017, 2:58 p.m., Marta Kuczora wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60432/
> -----------------------------------------------------------
> 
> (Updated July 13, 2017, 2:58 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-16845
>     https://issues.apache.org/jira/browse/HIVE-16845
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following steps lead to the NPE in the ConditionalResolverMergeFiles.generateActualTasks
method:
> 
> In the GenMapRedUtils.createCondTask method, the tasks for the merge, move and "merge
and move" use cases are created and set as task list to the ConditionalWork. Originally the
moveOnlyMoveTask and the mergeAndMoveMoveTask was created from the same moveWork, which was
the dummyWork created like this in the createMRWorkForMergingFiles method:
> 
>     MoveWork dummyMv = new MoveWork(null, null, null,
>          new LoadFileDesc(fsInputDesc.getFinalDirName(), finalName, true, null, null),
false);
> 
> 
> Then in the ConditionalResolverMergeFiles.generateActualTasks method we get these tasks
and use them to create result "resTsks" list.
> 
> For the "merge and move" use case, the code looks like this:
> 
>     if (toMove.size() > 0) {
>         resTsks.add(mrAndMvTask);
> 
>         MoveWork mvWork = (MoveWork) mvTask.getWork();
>         LoadFileDesc lfd = mvWork.getLoadFileWork();
>         
>         ...
>         
>         LoadMultiFilesDesc lmfd = new LoadMultiFilesDesc(toMove,
>             targetDirs, lfd.getIsDfsDir(), lfd.getColumns(), lfd.getColumnTypes());
>         mvWork.setLoadFileWork(null);
>         mvWork.setLoadTableWork(null);
>         mvWork.setMultiFilesDesc(lmfd);
>       }
> 
> It adds the mrAndMvTask task to the resTsks list and modifies the move work to move all
necessary files in one-step. The mrAndMvTask contains a move task as child task, which is
the same as the mvWork work. 
> 
> With the blobstore optimization on, the moveOnlyMoveTask task is created from a different
move work, not from the dummyMoveWork as before:
> 
>     MoveWork workForMoveOnlyTask;
>     if (shouldMergeMovePaths) {
>       workForMoveOnlyTask = mergeMovePaths(condInputPath, moveTaskToLink.getWork());
>     } else {
>       workForMoveOnlyTask = dummyMoveWork;
>     }
> 
>     ...
> 
>     Task<? extends Serializable> mergeOnlyMergeTask = TaskFactory.get(mergeWork,
conf);
>     Task<? extends Serializable> moveOnlyMoveTask = TaskFactory.get(workForMoveOnlyTask,
conf);
>     Task<? extends Serializable> mergeAndMoveMergeTask = TaskFactory.get(mergeWork,
conf);
>     Task<? extends Serializable> mergeAndMoveMoveTask = TaskFactory.get(dummyMoveWork,
conf);
> 
> Because of this the mvWork in the ConditionalResolverMergeFiles.generateActualTasks method
will also be different. It has the LoadTableDesc variable set and not the LoadFileDesc, that
causes the NPE.
> 
> When the blobstore optimization is on and the move work is changed, we should use the
child move task of the mrAndMvTask in the generateActualTasks method, instead of the mvTask.
Not just to avoid the NPE, but because this is the correct move task for the "merge and move"
use case.
> 
> 
> Diffs
> -----
> 
>   HIVE-16845.3.patch PRE-CREATION 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q
PRE-CREATION 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q
PRE-CREATION 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_overwrite_dynamic_partitions_move_only.q
PRE-CREATION 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_move.q.out
PRE-CREATION 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_merge_only.q.out
PRE-CREATION 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_overwrite_dynamic_partitions_move_only.q.out
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 4266569

> 
> 
> Diff: https://reviews.apache.org/r/60432/diff/3/
> 
> 
> Testing
> -------
> 
> Created q tests for the move-only, merge-only and move-and-merge use cases. The tests
do an "insert overwiew" with blobstore optimizations on and off.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message