hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyin Tang (JIRA)" <>
Subject [jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
Date Sat, 20 Nov 2010 19:38:15 GMT


Liyin Tang updated HIVE-1797:

    Attachment: hive-1797_2.patch

In this patch, all the hashtable dumped files will be compressed and packaged as a tar.gz
And the put this tar file to distributed cache. The distributed cache will decompress the
file for the mapper. If multiple mappers are in the same machine, only distributed cache will
only decompress once.

Please review.

> Compressed the hashtable dump file before put into distributed cache
> --------------------------------------------------------------------
>                 Key: HIVE-1797
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Liyin Tang
>            Assignee: Liyin Tang
>         Attachments: hive-1797.patch
> Clearly, the size of small table is the performance bottleneck for map join.
> Because the size of the small table will affect the memory usage and dumped hashtable
> That means there are 2 boundaries of the map join performance.
> 1)	The memory usage for local task and mapred task
> 2)	The dumped hashtable file size for distributed cache
> The reason that test case in last email spends most of the execution time on initializing
is because it hits the second boundary.
> Since we have already bound the memory usage, one thing we can do is to let the performance
never hits the secondary bound before it hits the first boundary.
> Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M
> local  task can roughly hold that 1.5M unique rows in memory. 
> Roughly the dumped file size will be 150M, which is too large to put into the distributed
> From experiments, we can basically conclude when the dumped file size is smaller than
> The distributed cache works well and all the mappers will  be initialized in a short
time (less than 30 secs).
> One easy implementation is to compress the hashtable file. 
> I use the gzip to compress the hashtable file and the file size is compressed from 100M
to 13M.
> After several tests, all the mappers will be initialized in less than 23 secs.
> But this solution adds some decompression overhead to each mapper.
> Mappers on the same machine will do the duplicated decompression work.
> Maybe in the future, we can let the distributed cache to support this.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message