hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Zhou (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
Date Thu, 26 Aug 2010 15:15:55 GMT

     [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Yan Zhou updated PIG-1501:

    Status: Patch Available  (was: Open)

This feature will save HDFS space used to store the intermediate data used by PIG and potentially
improve query execution speed. In general, the more intermediate data generated, the more
 storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info,
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch,
PIG-1501.patch, PIG-1501.patch
> We would like to understand how compressing map results as well as well as reducer output
in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message