pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent BARAT (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2271) PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
Date Thu, 06 Oct 2011 16:01:29 GMT

     [ https://issues.apache.org/jira/browse/PIG-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vincent BARAT updated PIG-2271:
-------------------------------

    Description: 
I'm using the 0.9.1 official release.

My input data are read form a text file 'activity' (provided as attachment):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My script is working correctly:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long,
name:chararray) });

-- store grouped activities again in an output file
STORE activities INTO 'output' USING PigStorage();
{code}

After running this script, the 'output' file contains a correct result:

{code}
00	{(1239698069000,)}
01	{(1239698505000,b),(1239698369000,a)}
02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
04	{(1239698417000,c)}
{code}

But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload my
temporary files. The 'output' file in that case is not complete:

{code}
00	
01	{(1239698505000,b),(1239698369000,a)}
02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
04	{(1239698417000,c)}
{code}

The not working script is the following:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long,
name:chararray) });

-- store grouped activities again in an output file
STORE activities INTO 'output' USING PigStorage();
{code}

So the issue seems to be located in the way the BinStorage() store or load bags.


  was:
I'm using the 0.9.1 official release.

My input data are read form a text file 'activity' (provided as attachment):

{code}
00,1239698069000, <- this is the line that is not correctly handled
01,1239698505000,b
01,1239698369000,a
02,1239698413000,b
02,1239698553000,c
02,1239698313000,a
03,1239698316000,a
03,1239698516000,c
03,1239698416000,b
03,1239698621000,d
04,1239698417000,c
{code}

My first script is working correctly:

{code}
-- load input data
activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long, name:chararray);

-- group input data
activities = GROUP activities BY sid;
activities = FOREACH activities GENERATE group, activities.(timestamp, name);

-- store grouped activities in a temporary file
STORE activities INTO 'tmp1' USING PigStorage();

-- reload grouped activities from the temporary file
activities = LOAD 'tmp1' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long,
name:chararray) });

-- store grouped activities again in another temporary file
STORE activities INTO 'tmp2' USING PigStorage();
{code}

The issue occurs when I use BinStorage() or PigStorage(',') instead of PigStorage() to store
/ reload my temporary files.


    
> PIG regression BinStorage/PigStorage between 0.8.1 and 0.9.x
> ------------------------------------------------------------
>
>                 Key: PIG-2271
>                 URL: https://issues.apache.org/jira/browse/PIG-2271
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Vincent BARAT
>            Priority: Critical
>         Attachments: activity
>
>
> I'm using the 0.9.1 official release.
> My input data are read form a text file 'activity' (provided as attachment):
> {code}
> 00,1239698069000, <- this is the line that is not correctly handled
> 01,1239698505000,b
> 01,1239698369000,a
> 02,1239698413000,b
> 02,1239698553000,c
> 02,1239698313000,a
> 03,1239698316000,a
> 03,1239698516000,c
> 03,1239698416000,b
> 03,1239698621000,d
> 04,1239698417000,c
> {code}
> My script is working correctly:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long,
name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long,
name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> After running this script, the 'output' file contains a correct result:
> {code}
> 00	{(1239698069000,)}
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> But the issue occurs when I use BinStorage() instead of PigStorage() to store / reload
my temporary files. The 'output' file in that case is not complete:
> {code}
> 00	
> 01	{(1239698505000,b),(1239698369000,a)}
> 02	{(1239698413000,b),(1239698553000,c),(1239698313000,a)}
> 03	{(1239698316000,a),(1239698516000,c),(1239698416000,b),(1239698621000,d)}
> 04	{(1239698417000,c)}
> {code}
> The not working script is the following:
> {code}
> -- load input data
> activities = LOAD 'activity' USING PigStorage(',') AS (sid:chararray, timestamp:long,
name:chararray);
> -- group input data
> activities = GROUP activities BY sid;
> activities = FOREACH activities GENERATE group, activities.(timestamp, name);
> -- store grouped activities in a temporary file
> STORE activities INTO 'tmp' USING PigStorage();
> -- reload grouped activities from the temporary file
> activities = LOAD 'tmp' USING PigStorage() AS (sid:chararray, acts:bag { act:tuple (timestamp:long,
name:chararray) });
> -- store grouped activities again in an output file
> STORE activities INTO 'output' USING PigStorage();
> {code}
> So the issue seems to be located in the way the BinStorage() store or load bags.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message