hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-318) [Hive] union all queries broken - all kinds of problems
Date Mon, 16 Mar 2009 19:25:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682418#action_12682418
] 

Ashish Thusoo commented on HIVE-318:
------------------------------------

Looked at this in a lot more detail with Namit. The following are the review comments:

1. The state maintained in the union operator context can be moved to the ParseContext to
be consistent with the model that we have today.
2. The init and state code can be moved to Operator.java and the reset logic can be refactored
to work on those states. There is no need for another reinit state. Init after close should
be transparently allowed.
3. We can change the plan to generate two different file sink operators on the parents of
the union operator while breaking the into map/reduce jobs. If we follow that strategy, we
can undo the changes to FileSinkOperator
and remove the special case code.
4. Please check indentation in UnionProcessor.java 

> [Hive] union all queries broken - all kinds of problems
> -------------------------------------------------------
>
>                 Key: HIVE-318
>                 URL: https://issues.apache.org/jira/browse/HIVE-318
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>            Priority: Blocker
>         Attachments: hive.318.2.patch, hive.318.3.patch, hive.318.4.patch, hive.318.patch
>
>
> 1. Map-only job : same input
>    Hangs because mapper tries to same open twice, and hadoop filesystem complains.
>    Fix: Only initialize once - keep state at the Operator level for the same. Should
do same for Close.
> 2. Map-only job : different inputs
>    Loss of data due to rename.
>    Fix: change rename to move files to the directory.
> 3. Map-only job in subquery + RedSink: works currently
> 4. 2 variables: so 4 sub-cases
>    Number of sub-queries having map-reduce jobs. (1/2)
>    Operator after Union (RS/FS)
>    
> a.   Number of sub-queries having map-reduce jobs. 1
>      Operator after Union: RS
>      Can be done in 2MR - really difficult with current infrastructure.
>      Should do with 3 MR jobs now - break on top of UNION. 
>      Future optimization: move operators between Union and RS before Union.
> b.   Number of sub-queries having map-reduce jobs. 2
>      Operator after Union: RS
>      Needs 3MR - Should do with 3 MR jobs - break on top of UNION. 
>      Future optimization: move operators between Union and RS before Union.
> c.   Number of sub-queries having map-reduce jobs. 1
>      Operator after Union: FS
>      Can be done in 1MR - really difficult with current infrastructure.
>      Can be easily done with 2 MR by removing UNION and cloning operators between Union
and FS.
>      Should do with 3 MR jobs now - break on top of UNION. 
>      Followup optimization: 2MR should be able to handle
> d.   Number of sub-queries having map-reduce jobs. 2
>      Operator after Union: FS
>      Can be easily done with 2 MR by removing UNION and cloning operators between Union
and FS.
>      Should do with 3 MR jobs now - break on top of UNION. 
>      Followup optimization: 2MR should be able to handle

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message