pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy V. Ryaboy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-2541) Automatic record provenance (source tagging) for PigStorage
Date Sat, 18 Feb 2012 22:31:59 GMT

    [ https://issues.apache.org/jira/browse/PIG-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211147#comment-13211147

Dmitriy V. Ryaboy commented on PIG-2541:

Prashant, I am thinking of the case when a loaded schema is something like (a:int, b:int)
but, due to loading with "using PigStorage('\t', '-useSchema -pig.source.tagging=true'), the
schema expected by the user is (a:int, b:int, source_tag:chararray). Since the loader doesn't
report this modified schema, the user won't be able to access the new field. I suspect regression
wasn't caught because you didn't test both options combined, and only used them separately.

This should "just work" on the storage, as opposed to loader, side, I don't think there's
a problem there as long as the loader side is fixed.

Regarding position of the tag -- I really think putting it in the beginning is better. As
I described above, putting it on the end leads to straight-up unpredictable results in some
circumstances; avoiding that situation takes precedence (in my mind) over convenience of modifying
existing scripts (which will need to be modified anyway to take advantage of this.. so in
for a penny, in for a pound).
> Automatic record provenance (source tagging) for PigStorage
> -----------------------------------------------------------
>                 Key: PIG-2541
>                 URL: https://issues.apache.org/jira/browse/PIG-2541
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.9.1
>            Reporter: Richard Ding
>            Assignee: Prashant Kommireddi
>         Attachments: PIG-2541.patch
> There are a lot of interests in knowing where the data comes from when loading from a
directory (or a set of directories). One can do it manually (see https://cwiki.apache.org/confluence/display/PIG/FAQ).
But it will be more convenient for users if we implement this in the PigStorage with a command
line option (e.g., pig.source.tagging=true/false) to turn it on/off. By default it will be

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message