Return-Path: X-Original-To: apmail-pig-dev-archive@www.apache.org Delivered-To: apmail-pig-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 012DE7E1F for ; Thu, 6 Oct 2011 16:01:56 +0000 (UTC) Received: (qmail 21952 invoked by uid 500); 6 Oct 2011 16:01:55 -0000 Delivered-To: apmail-pig-dev-archive@pig.apache.org Received: (qmail 21922 invoked by uid 500); 6 Oct 2011 16:01:55 -0000 Mailing-List: contact dev-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pig.apache.org Delivered-To: mailing list dev@pig.apache.org Received: (qmail 21914 invoked by uid 500); 6 Oct 2011 16:01:55 -0000 Delivered-To: apmail-hadoop-pig-dev@hadoop.apache.org Received: (qmail 21911 invoked by uid 99); 6 Oct 2011 16:01:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Oct 2011 16:01:55 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Oct 2011 16:01:51 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id EA4FA2AC7C5 for ; Thu, 6 Oct 2011 16:01:29 +0000 (UTC) Date: Thu, 6 Oct 2011 16:01:29 +0000 (UTC) From: "Vincent BARAT (Updated) (JIRA)" To: pig-dev@hadoop.apache.org Message-ID: <1194363452.3854.1317916889961.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1127558571.7304.1315561028802.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent BARAT updated PIG-2271: ------------------------------- Description: I'm using the 0.9.1 official release. My input data are read form a text file 'activity' (provided as attachment): {code} 00,1239698069000, <- this is the line that is not correctly handled 01,1239698505000,b 01,1239698369000,a 02,1239698413000,b 02,1239698553000,c 02,1239698313000,a 03,1239698316000,a 03,1239698516000,c 03,1239698416000,b 03,1239698621000,d 04,1239698417000,c {code} My script is working correctly: {code} -- load input data activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray); -- group input data activities = GROUP activities BY sid; activities = FOREACH activities GENERATE group, activities.(timestamp, name); -- store grouped activities in a temporary file STORE activities INTO 'tmp' USING PigStorage(); -- reload grouped activities from the temporary file activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) }); -- store grouped activities again in an output file STORE activities INTO 'output' USING PigStorage(); {code} After running this script, the 'output' file contains a correct result: {code} 00 {(1239698069000,)} 01 {(1239698505000,b),(1239698369000,a)} 02 {(1239698413000,b),(1239698553000,c),(1239698313000,a)} 03 {(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)} 04 {(1239698417000,c)} {code} But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete: {code} 00 01 {(1239698505000,b),(1239698369000,a)} 02 {(1239698413000,b),(1239698553000,c),(1239698313000,a)} 03 {(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)} 04 {(1239698417000,c)} {code} The not working script is the following: {code} -- load input data activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray); -- group input data activities = GROUP activities BY sid; activities = FOREACH activities GENERATE group, activities.(timestamp, name); -- store grouped activities in a temporary file STORE activities INTO 'tmp' USING PigStorage(); -- reload grouped activities from the temporary file activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) }); -- store grouped activities again in an output file STORE activities INTO 'output' USING PigStorage(); {code} So the issue seems to be located in the way the BinStorage() store or load bags. was: I'm using the 0.9.1 official release. My input data are read form a text file 'activity' (provided as attachment): {code} 00,1239698069000, <- this is the line that is not correctly handled 01,1239698505000,b 01,1239698369000,a 02,1239698413000,b 02,1239698553000,c 02,1239698313000,a 03,1239698316000,a 03,1239698516000,c 03,1239698416000,b 03,1239698621000,d 04,1239698417000,c {code} My first script is working correctly: {code} -- load input data activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray); -- group input data activities = GROUP activities BY sid; activities = FOREACH activities GENERATE group, activities.(timestamp, name); -- store grouped activities in a temporary file STORE activities INTO 'tmp1' USING PigStorage(); -- reload grouped activities from the temporary file activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) }); -- store grouped activities again in another temporary file STORE activities INTO 'tmp2' USING PigStorage(); {code} The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store / reload my temporary files. > PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x > ------------------------------------------------------------ > > Key: PIG-2271 > URL: https://issues.apache.org/jira/browse/PIG-2271 > Project: Pig > Issue Type: Bug > Affects Versions: 0.9.1 > Reporter: Vincent BARAT > Priority: Critical > Attachments: activity > > > I'm using the 0.9.1 official release. > My input data are read form a text file 'activity' (provided as attachment): > {code} > 00,1239698069000, <- this is the line that is not correctly handled > 01,1239698505000,b > 01,1239698369000,a > 02,1239698413000,b > 02,1239698553000,c > 02,1239698313000,a > 03,1239698316000,a > 03,1239698516000,c > 03,1239698416000,b > 03,1239698621000,d > 04,1239698417000,c > {code} > My script is working correctly: > {code} > -- load input data > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray); > -- group input data > activities = GROUP activities BY sid; > activities = FOREACH activities GENERATE group, activities.(timestamp, name); > -- store grouped activities in a temporary file > STORE activities INTO 'tmp' USING PigStorage(); > -- reload grouped activities from the temporary file > activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) }); > -- store grouped activities again in an output file > STORE activities INTO 'output' USING PigStorage(); > {code} > After running this script, the 'output' file contains a correct result: > {code} > 00 {(1239698069000,)} > 01 {(1239698505000,b),(1239698369000,a)} > 02 {(1239698413000,b),(1239698553000,c),(1239698313000,a)} > 03 {(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)} > 04 {(1239698417000,c)} > {code} > But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my temporary files. The 'output' file in that case is not complete: > {code} > 00 > 01 {(1239698505000,b),(1239698369000,a)} > 02 {(1239698413000,b),(1239698553000,c),(1239698313000,a)} > 03 {(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)} > 04 {(1239698417000,c)} > {code} > The not working script is the following: > {code} > -- load input data > activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray); > -- group input data > activities = GROUP activities BY sid; > activities = FOREACH activities GENERATE group, activities.(timestamp, name); > -- store grouped activities in a temporary file > STORE activities INTO 'tmp' USING PigStorage(); > -- reload grouped activities from the temporary file > activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long, name:chararray) }); > -- store grouped activities again in an output file > STORE activities INTO 'output' USING PigStorage(); > {code} > So the issue seems to be located in the way the BinStorage() store or load bags. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira