pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (PIG-1711) Document BinStorage behaviour
Date Tue, 18 Jan 2011 16:59:44 GMT

     [ https://issues.apache.org/jira/browse/PIG-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Olga Natkovich reassigned PIG-1711:

    Assignee: Corinne Chandel  (was: Olga Natkovich)

Here is what we need to document:

Pig uses BinStorage? to store/load data generated between Map-Reduce jobs. Also, occasionally,
users store their data using BinStorage?. Because this is a proprietory binary format, the
original data is never in BinStorage? - it is always a derivation of some other data.

We have seen several examples of users doing something like this:

a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();

And then later:

a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;

There is a problem with this sequence of events. The first script does not define data types
and, as the result, the data is stored as a bytearray and a bug with tuple with two bytearrays.
The second script attempts to cast the bytearray to double; however, since the data originated
from a different loader, it has no way to know the format of the bytearray or how to cast
it to a different type. Pig 0.9 addresses this issue in 2 different ways:

    * By giving a meaningful error message when the second script is executed: "ERROR 1118:
Cannot convert bytes load from BinStorage?"
    * By allowing the user to provide a converter to use during casting. 

a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;

> Document BinStorage behaviour 
> ------------------------------
>                 Key: PIG-1711
>                 URL: https://issues.apache.org/jira/browse/PIG-1711
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.6.0, 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Corinne Chandel
>             Fix For: 0.9.0
> We need to document some features of BinStorage that can cause indeterminate results.
> I have a Pig script of this type:
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> So we need to write correct documentation on why this happens. One good explanation seems
to be:
> According to Alan:
> BinStorage should not track data lineage. In the case where Pig is using BinStorage (or
whatever) for moving data between MR jobs then Pig can figure out the correct cast function
to use and apply it. For cases such as the one here where users are storing data using BinStorage
and then in a separate Pig Latin script reading it (and thus loosing the type information)
it is the users responsibility to correctly cast the data before storing it in BinStorage.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message