pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754
Date Wed, 16 Sep 2015 18:12:46 GMT

     [ https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-4679:
----------------------------
    Attachment: PIG-4679-1.patch

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> -----------------------------------------------------------------------
>
>                 Key: PIG-4679
>                 URL: https://issues.apache.org/jira/browse/PIG-4679
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>             Fix For: 0.16.0
>
>         Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN involving both HBase
tables and intermediate temp files), Pig 0.14 ReducerEstimator is returning total input size
as -1 (unknown) where as in Pig 0.12.1 it was returning the sum of temp file sizes as the
total size. Since -1 is returned as the input size, Pig end up using only one reducer for
the job.
> STEPS TO REPRODUCE:
> 1.	Create an HBase table with enough data.  Using PerformanceEvaluation tool to generate
data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 --rows=1000000 sequentialWrite
10
> {code}
> 2.	Dump the table data into a file which we can then use in a Pig JOIN.  Following Pig
script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data',
'-loadKey') AS (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.	Check file size to make sure that it is more than 1,000,000,000 which is the default
bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:           1           41        10280000000 hdfs:///tmp/re_test/test_table_data
> PROD:         1           57        10280000000 hdfs:///tmp/re_test/test_table_data
> {code}
> 4.	Run a Pig script that joins the HBase table with the data file.  QA and PROD will
use different number of reducers.  QA (176243) should run 1 reducer and PROD (176258) should
run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data',
'-loadKey') AS (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS (row_key: chararray,
data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message