pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ángel Álvarez (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PIG-4512) No performance improvement using OrcStorage
Date Tue, 28 Apr 2015 09:47:06 GMT

    [ https://issues.apache.org/jira/browse/PIG-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516745#comment-14516745
] 

Ángel Álvarez commented on PIG-4512:
------------------------------------

I've sorted the data as Daniel suggested, and this is what I've got:

				T1		T2		T3		T4               Average
HCatLoader		48134	46217	55369	54358	= 51019.5 	ms
OrcStorage		44290	49200	49984	50767	= 48560.25      ms
PigStorage		19307	24092	20952	24774	= 22281.25	ms

OrcStorage only improves HCatLoader by no more than 2 or 3 seconds on average. The curious
thing, PigStorage is the clearest winner (by far). Splitting the file before importing to
Hive, however, seems not to have any significant influence.

On the other hand, predicate pushdown is enabled in Hive by default (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties):

   hive.optimize.ppd
   Default Value: true
   Added In: Hive 0.4.0
   Whether to enable predicate pushdown (PPD). 

So, if I try to do more or less the same operation in Hive

export HADOOP_OPTS="-Dhive.execution.engine=tez"
hive -e "select uri,count(*) from nasadata_orc where uri=='test' group by uri;"

The one-row result is obtained in only 14048.25 ms  (on average). Does this mean my test in
PIg is not using Predicate Pushdown?

> No performance improvement using OrcStorage
> -------------------------------------------
>
>                 Key: PIG-4512
>                 URL: https://issues.apache.org/jira/browse/PIG-4512
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>         Environment: Hortonworks 2.2, Pig 14.0, Hive 0.14.0, Tez
>            Reporter: Ángel Álvarez
>            Priority: Minor
>
> I've been doing some tests with Pig & Hive, trying to gain some performance using
the OrcStorage class and his "Predicate Push Down" loader. I've followed the next steps:
> 1, Download a dataset
> ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
> 2. Create a new larger file by copying the same original file multiple times.
> cat NASA_access_log_Aug95 NASA_access_log_Aug95 ... > NASA
> 3. Add a new line in the data file
> echo 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET test HTTP/1.0" 200 9202'
>> NASA
> and split the file into different parts
> split -l 1000000 NASA NASA.
> 4. Create the ORC table in Hive
> DROP TABLE nasadata_txt;
> DROP TABLE nasadata_orc;
> CREATE TABLE nasadata_txt(ip VARCHAR(50), user_identifier VARCHAR(50), user_id VARCHAR(50),date_time
VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status
DECIMAL(3,0),size DECIMAL(10,0)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE;
> CREATE TABLE nasadata_orc(ip VARCHAR(50), user_identifier VARCHAR(50), user_id VARCHAR(50),date_time
VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status
DECIMAL(3,0),size DECIMAL(10,0)) STORED AS ORC;
> -- Load into Text table
> LOAD DATA LOCAL INPATH 'NASA.*' INTO TABLE nasadata_txt;
> -- Copy to ORC table
> INSERT OVERWRITE TABLE nasadata_orc SELECT * FROM nasadata_txt;
> 5.  Execute this pig script
> rmf /tmp/pruebaPPD;
> A = LOAD '/apps/hive/warehouse/nasadata_orc' using OrcStorage() as (ip,user_identifier,user_id,date_time,zone,method,uri,version,status,size);
> A = foreach A generate ip,uri,status;
> A = filter A by uri == 'test';
> A = group A by uri;
> A = foreach A generate group,COUNT(*);
> store A into '/tmp/pruebaPPD' using PigStorage(';');
> 6. Execute the previous script replacing OrcStorage by org.apache.hive.hcatalog.pig.HCatLoader.
> I can't see any difference in performance between using OrcStorage and HCatLoader. Is
there anything wrong in what I'm doing? Do I have to set any property?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message