hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1187) UTF-8 (international code) breaks with loader when load with schema is specified
Date Thu, 14 Jan 2010 19:29:54 GMT

    [ https://issues.apache.org/jira/browse/PIG-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800315#action_12800315
] 

Viraj Bhat commented on PIG-1187:
---------------------------------

Hi Jeff,
 This is specific to the data we are using and it looks like parser failed when it is trying
to interpret some characters. As such we have tested this with Chinese characters and it works.
Viraj

> UTF-8 (international code) breaks with loader when load with schema is specified
> --------------------------------------------------------------------------------
>
>                 Key: PIG-1187
>                 URL: https://issues.apache.org/jira/browse/PIG-1187
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.6.0
>
>
> I have a set of Pig statements which dump an international dataset.
> {code}
> INPUT_OBJECT = load 'internationalcode';
> describe INPUT_OBJECT;
> dump INPUT_OBJECT;
> {code}
> Sample output
> (756a6196-ebcd-4789-ad2f-175e5df65d55,{(labelAaÂâÀ),(labelあいうえお1),(labelஜார்க2),(labeladfadf)})
> It works and dumps results but when I use a schema for loading it fails.
> {code}
> INPUT_OBJECT = load 'internationalcode' AS (object_id:chararray, labels: bag {T: tuple(label:chararray)});
> describe INPUT_OBJECT;
> {code}
> The error message is as follows:2010-01-14 02:23:27,320 FATAL org.apache.hadoop.mapred.Child:
Error running child : org.apache.pig.data.parser.TokenMgrError: Error: Bailing out of infinite
loop caused by repeated empty string matches at line 1, column 21.
> 	at org.apache.pig.data.parser.TextDataParserTokenManager.TokenLexicalActions(TextDataParserTokenManager.java:620)
> 	at org.apache.pig.data.parser.TextDataParserTokenManager.getNextToken(TextDataParserTokenManager.java:569)
> 	at org.apache.pig.data.parser.TextDataParser.jj_ntk(TextDataParser.java:651)
> 	at org.apache.pig.data.parser.TextDataParser.Tuple(TextDataParser.java:152)
> 	at org.apache.pig.data.parser.TextDataParser.Bag(TextDataParser.java:100)
> 	at org.apache.pig.data.parser.TextDataParser.Datum(TextDataParser.java:382)
> 	at org.apache.pig.data.parser.TextDataParser.Parse(TextDataParser.java:42)
> 	at org.apache.pig.builtin.Utf8StorageConverter.parseFromBytes(Utf8StorageConverter.java:68)
> 	at org.apache.pig.builtin.Utf8StorageConverter.bytesToBag(Utf8StorageConverter.java:76)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:845)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:159)
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message