hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyin Tang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
Date Tue, 23 Nov 2010 00:16:13 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Liyin Tang updated HIVE-1797:
-----------------------------

    Status: Patch Available  (was: Open)

> Compressed the hashtable dump file before put into distributed cache
> --------------------------------------------------------------------
>
>                 Key: HIVE-1797
>                 URL: https://issues.apache.org/jira/browse/HIVE-1797
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: hive-1797.patch, hive-1797_3.patch
>
>
> Clearly, the size of small table is the performance bottleneck for map join.
> Because the size of the small table will affect the memory usage and dumped hashtable
file.
> That means there are 2 boundaries of the map join performance.
> 1)	The memory usage for local task and mapred task
> 2)	The dumped hashtable file size for distributed cache
> The reason that test case in last email spends most of the execution time on initializing
is because it hits the second boundary.
> Since we have already bound the memory usage, one thing we can do is to let the performance
never hits the secondary bound before it hits the first boundary.
> Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M
uncompressed),
> local  task can roughly hold that 1.5M unique rows in memory. 
> Roughly the dumped file size will be 150M, which is too large to put into the distributed
cache.
>  
> From experiments, we can basically conclude when the dumped file size is smaller than
30M. 
> The distributed cache works well and all the mappers will  be initialized in a short
time (less than 30 secs).
> One easy implementation is to compress the hashtable file. 
> I use the gzip to compress the hashtable file and the file size is compressed from 100M
to 13M.
> After several tests, all the mappers will be initialized in less than 23 secs.
> But this solution adds some decompression overhead to each mapper.
> Mappers on the same machine will do the duplicated decompression work.
> Maybe in the future, we can let the distributed cache to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message