pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leo Heska (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PIG-2613) Pig substitutes/mangles "upper ASCII" characters (values > 127)
Date Mon, 26 Mar 2012 10:58:27 GMT

     [ https://issues.apache.org/jira/browse/PIG-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Leo Heska updated PIG-2613:
---------------------------

    Attachment: DummyDataTS.txt
    
> Pig substitutes/mangles "upper ASCII" characters (values > 127)
> ---------------------------------------------------------------
>
>                 Key: PIG-2613
>                 URL: https://issues.apache.org/jira/browse/PIG-2613
>             Project: Pig
>          Issue Type: Bug
>          Components: data, parser
>    Affects Versions: 0.8.1
>         Environment: linux
>            Reporter: Leo Heska
>         Attachments: DummyDataTS.txt
>
>
> Create small/dummy input file that contains ASCII 254 (decimal) characters. These are
often represented as the Thorn character. A sample line looks like this:
>    1þ4þaaaþbbbþcccþdddþ7þ8þ9
> but your browser may not render that correctly. Hex representation of that sample line:
>    31FE34FE616161FE626262FE636363FE646464FE37FE38FE390D0A
> or, with spaces added for your convenience in reading:
>    31 FE 34 FE 61 61 61 FE 62 62 62 FE 63 63 63 FE 64 64 64 FE 37 FE 38 FE 39 0D 0A
> You can see that this is just a sample line of plain ASCII numerals and lower-case letters,
separated by the FE (hex) or 254 (decimal) code point.
> Now load, like this:
>    dummyts = load '/test/DummyDataTS.txt' using PigStorage(',') as (line:chararray);
> A dump 
>    dump dummyts;
>    
> shows this:
>    (1�4�aaa�bbb�ccc�ddd�7�8�9)
> The problem does not seem to be with the dump. I have written a UDF that counts characters
in the line and returns TRUE if the character count is correct. When I do this:
>  
>    fd = filter dummyts by CountRight(line, 254, 8);
> which is saying "validate that there are 8 instances of the ASCII 254 code point/character"
I get no results. When I do this:
>    fd1 = filter dummyts by CountRight(line, 97, 3);
> which says "validate that there are three instances of the 'a' (ASCII 97) character the
results are perfect.
> It looks like something in Pig's load is changing instances of ASCII 254 to the following
three characters:
>    �

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message