hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Dai (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
Date Mon, 27 Apr 2009 07:13:30 GMT

    [ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702656#action_12702656
] 

Daniel Dai edited comment on PIG-774 at 4/27/09 12:12 AM:
----------------------------------------------------------

As Alan said, adding option to QueryParser.jjt and ParamLoader.jj will do the trick. Probably
we do not need to hardcode "UTF8" into getBytes. If the OS encoding is UTF8 (LANG=UTF-8),
getBytes generates byte array using OS encoding, which is UTF8. If the OS is native encoding
(LANG=GB2312), getBytes generate byte array of native encoding, then JavaCharStream will interpret
the input stream as native encoding also, so everything goes fine.

One thing I want to point out. On UTF8 OS, everything is perfect. However, on legacy system
with native encoding, PigStorage treats all input/output file UTF8, which is reasonable because
all data files come from or go to hadoop backend for which UTF8 is highly desired. However,
these input/output files cannot be read by "vi" on OS with native encoding. Most applications
(eg: vi, cat) interpret input file using OS encoding. In addition, if we do a Pig dump on
such OS, we will see UTF8 output stream which is messy. Script files and parameter files are
local and most users will use vi to edit. We shall interpret script files and parameter files
as OS encoding. 

utf8_parser-1.patch is a preliminary patch. Viraj, can you give a try?

Also we need to fix jline. It does not deal with multibyte characters well now.

      was (Author: daijy):
    As Alan said, adding option to QueryParser.jjt and ParamLoader.jj will do the trick. Probably
we do not need to hardcode "UTF8" into getBytes. If the OS encoding is UTF8 (LANG=UTF-8),
getBytes generates byte array using OS encoding, which is UTF8. If the OS is native encoding
(LANG=GB2312), getBytes generate byte array of native encoding, then SimpleCharStream will
interpret the input stream as native encoding also, so everything goes fine.

One thing I want to point out. On UTF8 OS, everything is perfect. However, on legacy system
with native encoding, PigStorage treats all input/output file UTF8, which is reasonable because
all data files come from or go to hadoop backend for which UTF8 is highly desired. However,
these input/output files cannot be read by "vi" on OS with native encoding. Most applications
(eg: vi, cat) interpret input file using OS encoding. In addition, if we do a Pig dump on
such OS, we will see UTF8 output stream which is messy. Script files and parameter files are
local and most users will use vi to edit. We shall interpret script files and parameter files
as OS encoding. 

utf8_parser-1.patch is a preliminary patch. Viraj, can you give a try?

Also we need to fix jline. It does not deal with multibyte characters well now.
  
> Pig does not handle Chinese characters (in both the parameter subsitution using -param_file
or embedded in the Pig script) correctly
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-774
>                 URL: https://issues.apache.org/jira/browse/PIG-774
>             Project: Pig
>          Issue Type: Bug
>          Components: grunt, impl
>    Affects Versions: 0.0.0
>            Reporter: Viraj Bhat
>            Priority: Critical
>             Fix For: 0.0.0
>
>         Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch
>
>
> I created a very small test case in which I did the following.
> 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS.
I used this dfs file as an input for the tests.
> 2) Created a parameter file which also contained the same query string as in Step 1.
> 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese
character.
> ================================================================
> Pig script: chinese_data.pig
> ================================================================
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == '$querystring';
> --J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> store J into 'chineseoutput';
> dump J;
> {code}
> =================================================================
> Parameter file: nextgen_paramfile
> =================================================================
> queryid=20090311
> querystring='   歌手    香港情牽女人心演唱會'
> =================================================================
> Input file: /user/viraj/chinese.txt
> =================================================================
> shell$ hadoop fs -cat /user/viraj/chinese.txt
>         歌手    香港情牽女人心演唱會
> =================================================================
> I ran the above set of inputs in the following ways:
> Run 1:
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
-param_file nextgen_paramfile chinese_data.pig
> {code}
> =================================================================
> 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser
for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:31:40,700 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 0% complete
> 2009-04-22 01:31:50,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 100% complete
> 2009-04-22 01:31:50,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> Success!
> =================================================================
> Run 2: removed the parameter substitution in the Pig script instead used the following
statement.
> =================================================================
> {code}
> J = filter I by $0 == ' 歌手    香港情牽女人心演唱會';
> {code}
> =================================================================
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
chinese_data_withoutparam.pig
> =================================================================
> 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser
for parsing the
> arguments. Applications should implement Tool for the same.
> 2009-04-22 01:35:27,399 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 0% complete
> 2009-04-22 01:35:32,415 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> 100% complete
> 2009-04-22 01:35:32,415 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
-
> Success!
> =================================================================
> In both cases:
> =================================================================
> {code}
> shell $ hadoop fs -ls /user/viraj/chineseoutput
> Found 2 items
> drwxr-xr-x   - viraj supergroup          0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs
> -rw-r--r--   3 viraj supergroup          0 2009-04-22 01:37 /user/viraj/chineseoutput/part-00000
> {code}
> =================================================================
> Additionally tried the dry-run option to figure out if the parameter substitution was
occurring properly.
> =================================================================
> {code}
> java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main
-param_file nextgen_paramfile -r chinese_data.pig
> {code}
> =================================================================
> {code}
> shell$ file chinese_data.pig.substituted 
> chinese_data.pig.substituted: ASCII text
> shell$ cat chinese_data.pig.substituted 
> {code}
> {code}
> rmf chineseoutput;
> I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
> J = filter I by $0 == ' ??????  ??????????????????????????????';
> store J into 'chineseoutput';
> {code}
> =================================================================
> This issue has to do with the parser not handling UTF-8 data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message