hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Viraj Bhat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
Date Wed, 29 Apr 2009 01:18:30 GMT

    [ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703937#action_12703937
] 

Viraj Bhat commented on PIG-774:
--------------------------------

Daniel, 
 Thanks again for your patch, I worked with Pradeep and changed the parser code to invoke
that behavior you suggested and then filed Jira PIG-774. 
Here is one problem that I faced..
Suppose I have a script like this, known as chinese_data.pig
{code}
rmf chineseoutput;
%default querystring 'myquery';
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

--dump I;

J = filter I by $0 == '$querystring';
--J = filter I by $0 == '       歌手    香港情牽女人心演唱會';

--store J into 'chineseoutput';
dump J;
{code}

I have a parameter file known as "nextgen_paramfile" which contains the $querystring variable..

{code}
querystring="   歌手    香港情牽女人心演唱會"
{code}

I run the above script and parameter file as:
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file
nextgen_paramfile chinese_data.pig
{code}

I get the following error:
================================================================================
2009-04-29 01:05:14,979 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine
- Connecting to hadoop file system at: hdfs://localhost:9000
2009-04-29 01:05:16,328 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine
- Connecting to map-reduce job tracker at: localhost:9001
2009-04-29 01:05:16,907 [main] INFO  org.apache.pig.PigServer - Create a new graph.
2009-04-29 01:05:17,794 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error
during parsing. Lexical error at line 7, column 33.  Encountered: "\u6b4c" (27468), after
: "
================================================================================
I realized that it was something to do with the commented line in the pig script. 
{code}
--J = filter I by $0 == '       歌手    香港情牽女人心演唱會';
{code}
Why is that so, I am attaching the pig_*log on this Jira.

Additionally I found that the parameter substitution is happening correctly when I run the
script as:
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file
nextgen_paramfile -r chinese_data.pig
{code}
The substituted file, chinese_data.pig.substituted is correct.

Viraj

> Pig does not handle Chinese characters (in both the parameter subsitution using -param_file
or embedded in the Pig script) correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-774
>                 URL: https://issues.apache.org/jira/browse/PIG-774
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt, impl
>    Affects Versions: 0.0.0
>            Reporter: Viraj Bhat
>            Priority: Critical
>             Fix For: 0.0.0
>
>         Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch
>
>
> I created a very small test case in which I did the following.
> 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS.
I used this dfs file as an input for the tests.
> 2) Created a parameter file which also contained the same query string as in Step 1.
> 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese
character.
> ================================================================
> Pig script: chinese_data.pig
> ================================================================
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == '$querystring';
> --J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> store J into 'chineseoutput';
> dump J;
> {code}
> =================================================================
> Parameter file: nextgen_paramfile
> =================================================================
> queryid=20090311
> querystring='   歌手    香港情牽女人心演唱會'
> =================================================================
> Input file: /user/viraj/chinese.txt
> =================================================================
> shell$ hadoop fs -cat /user/viraj/chinese.txt
>         歌手    香港情牽女人心演唱會
> =================================================================
> I ran the above set of inputs in the following ways:
> Run 1:
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
-param_file nextgen_paramfile chinese_data.pig
> {code}
> =================================================================
> 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser
for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:31:40,700 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 0% complete
> 2009-04-22 01:31:50,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 100% complete
> 2009-04-22 01:31:50,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> Success!
> =================================================================
> Run 2: removed the parameter substitution in the Pig script instead used the following
statement.
> =================================================================
> {code}
> J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> {code}
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
chinese_data_withoutparam.pig
> =================================================================
> 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser
for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:35:27,399 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 0% complete
> 2009-04-22 01:35:32,415 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 100% complete
> 2009-04-22 01:35:32,415 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> Success!
> =================================================================
> In both cases:
> =================================================================
> {code}
> shell $ hadoop fs -ls /user/viraj/chineseoutput
> Found 2 items
> drwxr-xr-x   - viraj supergroup          0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs
> -rw-r--r--   3 viraj supergroup          0 2009-04-22 01:37 /user/viraj/chineseoutput/part-00000
> {code}
> =================================================================
> Additionally tried the dry-run option to figure out if the parameter substitution was
occurring properly.
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
-param_file nextgen_paramfile -r chinese_data.pig
> {code}
> =================================================================
> {code}
> shell$ file chinese_data.pig.substituted 
> chinese_data.pig.substituted: ASCII text
> shell$ cat chinese_data.pig.substituted 
> {code}
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == ' ??????  ??????????????????????????????';
> store J into 'chineseoutput';
> {code}
> =================================================================
> This issue has to do with the parser not handling UTF-8 data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message