hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3997) Use distributed cache to cache/localize dimension table & filter it in map task setup
Date Fri, 15 Feb 2013 07:49:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13579018#comment-13579018
] 

Gopal V commented on HIVE-3997:
-------------------------------

All map-tasks happening in a single wave, but some of the hashtable generation before the
map-side task is taking 2x the time it took on the client node.

This is probably because of CPU starvation on the map-task because of too many parallel tasks
- couldn't find a way to tune down the map count per node from 12 (as it is doing now) because
the NodeManager does not seem to have a tunable for it (?).
                
> Use distributed cache to cache/localize dimension table & filter it in map task setup
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-3997
>                 URL: https://issues.apache.org/jira/browse/HIVE-3997
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Gopal V
>            Assignee: Gopal V
>
> The hive clients are not always co-located with the hadoop/hdfs cluster.
> This means that the dimension table filtering, when done on the client side becomes very
slow. Not only that, the conversion of the small tables into hashtables has to be done every
single time a query is run with different filters on the big table.
> That entire hashtable has to be part of the job, which involves even more HDFS writes
from the far client side.
> Using the distributed cache also has the advantage that the localized files can be kept
between jobs instead of firing off an HDFS read for every query.
> Moving the operator pipeline for the hash generation into the map task itself has perhaps
a few cons.
> The map task might OOM due to this change, but it will take longer to recover until all
the map attempts fail, instead of being conditional on the client. The client has no idea
how much memory the hashtable needs and has to rely on the disk sizes (compressed sizes, perhaps)
to determine if it needs to fall back onto a reduce-join instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message