hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-968) map join may lead to very large files
Date Sat, 05 Dec 2009 05:00:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786340#action_12786340
] 

Ning Zhang commented on HIVE-968:
---------------------------------

That's true. But even though the hash table is randomly accessed, I'm not sure if MRU will
help with performance here. This wrapper is meant to provide a simple wrapper for the HashMap
data structure. MRU added cost to memory consumption as well as CPU cost. I'm assuming most
case it will fall into the case where the threshold is not reached. In that case MRU is not
useful and wasting resources. 

> map join may lead to very large files
> -------------------------------------
>
>                 Key: HIVE-968
>                 URL: https://issues.apache.org/jira/browse/HIVE-968
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Ning Zhang
>         Attachments: HIVE-968.patch, HIVE-968_2.patch
>
>
> If the table under consideration is a very large file, it may lead to very large files
on the mappers. 
> The job may never complete, and the files will never be cleaned from the tmp directory.

> It would be great if the table can be placed in the distributed cache, but minimally
the following should be added:
> If the table (source) being joined leads to a very big file, it should just throw an
error.
> New configuration parameters can be added to limit the number of rows or for the size
of the table.
> The mapper should not be tried 4 times, but it should fail immediately.
> I cant think of any better way for the mapper to communicate with the client, but for
it to write in some well known
> hdfs file - the client can read the file periodically (while polling), and if sees an
error can just kill the job, but with
> appropriate error messages indicating to the client why the job died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message