Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D0CF1875A for ; Tue, 28 Apr 2015 09:47:07 +0000 (UTC) Received: (qmail 68662 invoked by uid 500); 28 Apr 2015 09:47:07 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 68608 invoked by uid 500); 28 Apr 2015 09:47:07 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 68596 invoked by uid 500); 28 Apr 2015 09:47:07 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 68593 invoked by uid 99); 28 Apr 2015 09:47:07 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Apr 2015 09:47:07 +0000 Date: Tue, 28 Apr 2015 09:47:06 +0000 (UTC) From: =?utf-8?Q?=C3=81ngel_=C3=81lvarez_=28JIRA=29?= To: pig-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PIG-4512) No performance improvement using OrcStorage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PIG-4512?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D145167= 45#comment-14516745 ]=20 =C3=81ngel =C3=81lvarez commented on PIG-4512: ------------------------------------ I've sorted the data as Daniel suggested, and this is what I've got: =09=09=09=09T1=09=09T2=09=09T3=09=09T4 Average HCatLoader=09=0948134=0946217=0955369=0954358=09=3D 51019.5 =09ms OrcStorage=09=0944290=0949200=0949984=0950767=09=3D 48560.25 ms PigStorage=09=0919307=0924092=0920952=0924774=09=3D 22281.25=09ms OrcStorage only improves HCatLoader by no more than 2 or 3 seconds on avera= ge. The curious thing, PigStorage is the clearest winner (by far). Splittin= g the file before importing to Hive, however, seems not to have any signifi= cant influence. On the other hand, predicate pushdown is enabled in Hive by default (https:= //cwiki.apache.org/confluence/display/Hive/Configuration+Properties): hive.optimize.ppd Default Value: true Added In: Hive 0.4.0 Whether to enable predicate pushdown (PPD).=20 So, if I try to do more or less the same operation in Hive export HADOOP_OPTS=3D"-Dhive.execution.engine=3Dtez" hive -e "select uri,count(*) from nasadata_orc where uri=3D=3D'test' group = by uri;" The one-row result is obtained in only 14048.25 ms (on average). Does this= mean my test in PIg is not using Predicate Pushdown? > No performance improvement using OrcStorage > ------------------------------------------- > > Key: PIG-4512 > URL: https://issues.apache.org/jira/browse/PIG-4512 > Project: Pig > Issue Type: Bug > Affects Versions: 0.14.0 > Environment: Hortonworks 2.2, Pig 14.0, Hive 0.14.0, Tez > Reporter: =C3=81ngel =C3=81lvarez > Priority: Minor > > I've been doing some tests with Pig & Hive, trying to gain some performan= ce using the OrcStorage class and his "Predicate Push Down" loader. I've fo= llowed the next steps: > 1, Download a dataset > ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz > 2. Create a new larger file by copying the same original file multiple ti= mes. > cat NASA_access_log_Aug95 NASA_access_log_Aug95 ... > NASA > 3. Add a new line in the data file > echo 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET test HTT= P/1.0" 200 9202' >> NASA > and split the file into different parts > split -l 1000000 NASA NASA. > 4. Create the ORC table in Hive > DROP TABLE nasadata_txt; > DROP TABLE nasadata_orc; > CREATE TABLE nasadata_txt(ip VARCHAR(50), user_identifier VARCHAR(50), us= er_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),= uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size DECIMAL(10,0)= ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE; > CREATE TABLE nasadata_orc(ip VARCHAR(50), user_identifier VARCHAR(50), us= er_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method VARCHAR(5),= uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size DECIMAL(10,0)= ) STORED AS ORC; > -- Load into Text table > LOAD DATA LOCAL INPATH 'NASA.*' INTO TABLE nasadata_txt; > -- Copy to ORC table > INSERT OVERWRITE TABLE nasadata_orc SELECT * FROM nasadata_txt; > 5. Execute this pig script > rmf /tmp/pruebaPPD; > A =3D LOAD '/apps/hive/warehouse/nasadata_orc' using OrcStorage() as (ip,= user_identifier,user_id,date_time,zone,method,uri,version,status,size); > A =3D foreach A generate ip,uri,status; > A =3D filter A by uri =3D=3D 'test'; > A =3D group A by uri; > A =3D foreach A generate group,COUNT(*); > store A into '/tmp/pruebaPPD' using PigStorage(';'); > 6. Execute the previous script replacing OrcStorage by org.apache.hive.hc= atalog.pig.HCatLoader. > I can't see any difference in performance between using OrcStorage and HC= atLoader. Is there anything wrong in what I'm doing? Do I have to set any p= roperty? -- This message was sent by Atlassian JIRA (v6.3.4#6332)