hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyin Tang (JIRA)" <j...@apache.org>
Subject [jira] Created: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
Date Thu, 18 Nov 2010 00:41:14 GMT
Compressed the hashtable dump file before put into distributed cache
--------------------------------------------------------------------

                 Key: HIVE-1797
                 URL: https://issues.apache.org/jira/browse/HIVE-1797
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
    Affects Versions: 0.7.0
            Reporter: Liyin Tang
            Assignee: Liyin Tang


Clearly, the size of small table is the performance bottleneck for map join.
Because the size of the small table will affect the memory usage and dumped hashtable file.
That means there are 2 boundaries of the map join performance.
1)	The memory usage for local task and mapred task
2)	The dumped hashtable file size for distributed cache

The reason that test case in last email spends most of the execution time on initializing
is because it hits the second boundary.
Since we have already bound the memory usage, one thing we can do is to let the performance
never hits the secondary bound before it hits the first boundary.

Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M uncompressed),
local  task can roughly hold that 1.5M unique rows in memory. 
Roughly the dumped file size will be 150M, which is too large to put into the distributed
cache.
 
>From experiments, we can basically conclude when the dumped file size is smaller than
30M. 
The distributed cache works well and all the mappers will  be initialized in a short time
(less than 30 secs).

One easy implementation is to compress the hashtable file. 
I use the gzip to compress the hashtable file and the file size is compressed from 100M to
13M.
After several tests, all the mappers will be initialized in less than 23 secs.

But this solution adds some decompression overhead to each mapper.
Mappers on the same machine will do the duplicated decompression work.
Maybe in the future, we can let the distributed cache to support this.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message