hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
Date Wed, 24 Nov 2010 21:54:16 GMT


He Yongqiang updated HIVE-1797:

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed! Thanks Liyin!

> Compressed the hashtable dump file before put into distributed cache
> --------------------------------------------------------------------
>                 Key: HIVE-1797
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: hive-1797.patch, hive-1797_3.patch
> Clearly, the size of small table is the performance bottleneck for map join.
> Because the size of the small table will affect the memory usage and dumped hashtable
> That means there are 2 boundaries of the map join performance.
> 1)	The memory usage for local task and mapred task
> 2)	The dumped hashtable file size for distributed cache
> The reason that test case in last email spends most of the execution time on initializing
is because it hits the second boundary.
> Since we have already bound the memory usage, one thing we can do is to let the performance
never hits the secondary bound before it hits the first boundary.
> Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M
> local  task can roughly hold that 1.5M unique rows in memory. 
> Roughly the dumped file size will be 150M, which is too large to put into the distributed
> From experiments, we can basically conclude when the dumped file size is smaller than
> The distributed cache works well and all the mappers will  be initialized in a short
time (less than 30 secs).
> One easy implementation is to compress the hashtable file. 
> I use the gzip to compress the hashtable file and the file size is compressed from 100M
to 13M.
> After several tests, all the mappers will be initialized in less than 23 secs.
> But this solution adds some decompression overhead to each mapper.
> Mappers on the same machine will do the duplicated decompression work.
> Maybe in the future, we can let the distributed cache to support this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message