hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-964) handle skewed keys for a join in a separate job
Date Fri, 15 Jan 2010 00:12:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800445#action_12800445
] 

Ning Zhang commented on HIVE-964:
---------------------------------

Some more comments:

1) RowContainer.java:134 and 207 can you define a enum in HiveConf and use that instead of
the string here?
2) RowConainer.java:147 the if condition should always be true due to the assertion in line
144. So if should be removed. Also in setSerDe dummyRow doesn't need to be set here since
it will be passed by the caller (e.g., CommonJoinOperator) who construct the dummy row and
passed by add(). Please take a look at add() line 165.
3) please move variable declarations in 171-177 to the beginning of the class where most variables
are declared and add a brief comment on each of them.
4) the firstCalled boolean should be cleared at add() otherwise the following situation may
give wrong results: add, first, add, next, next. 
5) in first(), the closeWriter(), closeReader() are called for each first(), this may cause
bad performance when the RowContainer is iterated many times and there is no 
6) InputFormat in line 204. It could be very expensive if the RowContainer is iterated many
times
7) Can you rename the variable originalReadBlock to firstBlock, which is easier to understand..

8) in nextBlock Writable val is a new instance of serde for every new block, can we reuse
the serde?
9) key is inserted for each row as the first element before spillBlock and after nextBlock.
This is too expensive given the row is an ArrayList. Zheng suggested to use UnionStructObjectInspector
to handle key and value separately. 


> handle skewed keys for a join in a separate job
> -----------------------------------------------
>
>                 Key: HIVE-964
>                 URL: https://issues.apache.org/jira/browse/HIVE-964
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>         Attachments: hive-964-2009-12-17.txt, hive-964-2009-12-28-2.patch, hive-964-2009-12-29-4.patch,
hive-964-2010-01-08.patch, hive-964-2010-01-13-2.patch
>
>
> The skewed keys can be written to a temporary table or file, and a followup conditional
task can be used to perform the join on those keys.
> As a first step, JDBM can be used for those keys

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message