hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-968) map join may lead to very large files
Date Thu, 10 Dec 2009 00:03:18 GMT

    [ https://issues.apache.org/jira/browse/HIVE-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788402#action_12788402

Ning Zhang commented on HIVE-968:

Discussed with Namit offline. Below are the updates:
1) The System.out.println() is for printing out some debugging info when assertion failed.
I'll change it to use LOG.error().
2) o.setObj() works because whenever an object is got from get, it is guaranteed to be in
main memory cache. So setObj will set the object in MRUItem which change the HashMap value.
It is an performance optimization and I will add more comments in the code.  
3) there is one issue in the put() code path when key is not in main memory but in persistent
hash. fixed that and added a unit test for that case.
4) changed the JDBM TransactionManager to delete log file if NO_TRANSACTION is set. 

Will upload the patch shortly after the uni tests finish. 

> map join may lead to very large files
> -------------------------------------
>                 Key: HIVE-968
>                 URL: https://issues.apache.org/jira/browse/HIVE-968
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Ning Zhang
>         Attachments: HIVE-968.patch, HIVE-968_2.patch, HIVE-968_3.patch
> If the table under consideration is a very large file, it may lead to very large files
on the mappers. 
> The job may never complete, and the files will never be cleaned from the tmp directory.

> It would be great if the table can be placed in the distributed cache, but minimally
the following should be added:
> If the table (source) being joined leads to a very big file, it should just throw an
> New configuration parameters can be added to limit the number of rows or for the size
of the table.
> The mapper should not be tried 4 times, but it should fail immediately.
> I cant think of any better way for the mapper to communicate with the client, but for
it to write in some well known
> hdfs file - the client can read the file periodically (while polling), and if sees an
error can just kill the job, but with
> appropriate error messages indicating to the client why the job died.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message