hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <>
Subject [jira] Commented: (HIVE-968) map join may lead to very large files
Date Sat, 05 Dec 2009 01:57:20 GMT


Ning Zhang commented on HIVE-968:

I tested the performance of committing always and per 100 updates, there is no difference
I can notice. So that's why I removed this. 

For MRU I'm not sure whether it will lead to better performance or not. The use case in MapJoin
is the data is sequentially read, so every entry is read once for each tuple in the LHS. So
there is MRU seems useless. 

> map join may lead to very large files
> -------------------------------------
>                 Key: HIVE-968
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Ning Zhang
>         Attachments: HIVE-968.patch, HIVE-968_2.patch
> If the table under consideration is a very large file, it may lead to very large files
on the mappers. 
> The job may never complete, and the files will never be cleaned from the tmp directory.

> It would be great if the table can be placed in the distributed cache, but minimally
the following should be added:
> If the table (source) being joined leads to a very big file, it should just throw an
> New configuration parameters can be added to limit the number of rows or for the size
of the table.
> The mapper should not be tried 4 times, but it should fail immediately.
> I cant think of any better way for the mapper to communicate with the client, but for
it to write in some well known
> hdfs file - the client can read the file periodically (while polling), and if sees an
error can just kill the job, but with
> appropriate error messages indicating to the client why the job died.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message