hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
Date Tue, 31 Aug 2010 23:36:55 GMT

     [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yan Zhou updated PIG-1501:
--------------------------

    Release Note: 
This feature will save HDFS space used to store the intermediate data used by PIG and potentially
improve query execution speed. In general, the more intermediate data generated, the more
storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should be compressed
or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only
accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need
to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ
for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info,
page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

[ Show ยป ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space
used to store the intermediate data used by PIG and potentially improve query execution speed.
In general, the more intermediate data generated, the more storage and speedup benefits. There
are no backward compatibility issues as result of this feature. An example is the following
"test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views'
using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long,
query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by
timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join
B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300;
store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true
-Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig 


> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch,
PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message