Mailing-List: contact dev-help@pig.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@pig.apache.org
Date: Tue, 28 Apr 2015 09:47:06 +0000 (UTC)
From: =?utf-8?Q?=C3=81ngel_=C3=81lvarez_=28JIRA=29?= <jira@apache.org>
To: pig-dev@hadoop.apache.org
Message-ID: <JIRA.12822239.1429516232000.5490.1430214426953@Atlassian.JIRA>
In-Reply-To: <JIRA.12822239.1429516232000@Atlassian.JIRA>
References: <JIRA.12822239.1429516232000@Atlassian.JIRA>
 <JIRA.12822239.1429516232201@arcas>
Subject: [jira] [Commented] (PIG-4512) No performance improvement using
 OrcStorage
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/PIG-4512?page=3Dcom.atlassian.j=
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D145167=
45#comment-14516745 ]=20

=C3=81ngel =C3=81lvarez commented on PIG-4512:
------------------------------------

I've sorted the data as Daniel suggested, and this is what I've got:

=09=09=09=09T1=09=09T2=09=09T3=09=09T4               Average
HCatLoader=09=0948134=0946217=0955369=0954358=09=3D 51019.5 =09ms
OrcStorage=09=0944290=0949200=0949984=0950767=09=3D 48560.25      ms
PigStorage=09=0919307=0924092=0920952=0924774=09=3D 22281.25=09ms

OrcStorage only improves HCatLoader by no more than 2 or 3 seconds on avera=
ge. The curious thing, PigStorage is the clearest winner (by far). Splittin=
g the file before importing to Hive, however, seems not to have any signifi=
cant influence.

On the other hand, predicate pushdown is enabled in Hive by default (https:=
//cwiki.apache.org/confluence/display/Hive/Configuration+Properties):

   hive.optimize.ppd
   Default Value: true
   Added In: Hive 0.4.0
   Whether to enable predicate pushdown (PPD).=20

So, if I try to do more or less the same operation in Hive

export HADOOP_OPTS=3D"-Dhive.execution.engine=3Dtez"
hive -e "select uri,count(*) from nasadata_orc where uri=3D=3D'test' group =
by uri;"

The one-row result is obtained in only 14048.25 ms  (on average). Does this=
 mean my test in PIg is not using Predicate Pushdown?

> No performance improvement using OrcStorage
> -------------------------------------------
>
>                 Key: PIG-4512
>                 URL: https://issues.apache.org/jira/browse/PIG-4512
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>         Environment: Hortonworks 2.2, Pig 14.0, Hive 0.14.0, Tez
>            Reporter: =C3=81ngel =C3=81lvarez
>            Priority: Minor
>
> I've been doing some tests with Pig & Hive, trying to gain some performan=
ce using the OrcStorage class and his "Predicate Push Down" loader. I've fo=
llowed the next steps:
> 1, Download a dataset
> ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
> 2. Create a new larger file by copying the same original file multiple ti=
mes.
> cat NASA_access_log_Aug95 NASA_access_log_Aug95 ... > NASA
> 3. Add a new line in the data file
> echo 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET test HTT=
P/1.0" 200 9202' >> NASA
> and split the file into different parts
> split -l 1000000 NASA NASA.
> 4. Create the ORC table in Hive
> DROP TABLE nasadata_txt;
> DROP TABLE nasadata_orc;
> CREATE TABLE nasadata_txt(ip VARCHAR(50), user_identifier VARCHAR(50), us=
er_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),=
uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size DECIMAL(10,0)=
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE;
> CREATE TABLE nasadata_orc(ip VARCHAR(50), user_identifier VARCHAR(50), us=
er_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),=
uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size DECIMAL(10,0)=
) STORED AS ORC;
> -- Load into Text table
> LOAD DATA LOCAL INPATH 'NASA.*' INTO TABLE nasadata_txt;
> -- Copy to ORC table
> INSERT OVERWRITE TABLE nasadata_orc SELECT * FROM nasadata_txt;
> 5.  Execute this pig script
> rmf /tmp/pruebaPPD;
> A =3D LOAD '/apps/hive/warehouse/nasadata_orc' using OrcStorage() as (ip,=
user_identifier,user_id,date_time,zone,method,uri,version,status,size);
> A =3D foreach A generate ip,uri,status;
> A =3D filter A by uri =3D=3D 'test';
> A =3D group A by uri;
> A =3D foreach A generate group,COUNT(*);
> store A into '/tmp/pruebaPPD' using PigStorage(';');
> 6. Execute the previous script replacing OrcStorage by org.apache.hive.hc=
atalog.pig.HCatLoader.
> I can't see any difference in performance between using OrcStorage and HC=
atLoader. Is there anything wrong in what I'm doing? Do I have to set any p=
roperty?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)