Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 52272 invoked from network); 18 May 2009 19:10:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 May 2009 19:10:11 -0000 Received: (qmail 3201 invoked by uid 500); 18 May 2009 19:10:11 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 3173 invoked by uid 500); 18 May 2009 19:10:11 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 3163 invoked by uid 99); 18 May 2009 19:10:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 May 2009 19:10:10 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED,FB_WORD1_END_DOLLAR X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 May 2009 19:10:07 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A4EDB29A0015 for ; Mon, 18 May 2009 12:09:45 -0700 (PDT) Message-ID: <172509444.1242673785674.JavaMail.jira@brutus> Date: Mon, 18 May 2009 12:09:45 -0700 (PDT) From: "Daniel Dai (JIRA)" To: pig-dev@hadoop.apache.org Subject: [jira] Resolved: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly In-Reply-To: <90635560.1240368227919.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/PIG-774?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-774. ---------------------------- Resolution: Fixed Fix Version/s: (was: 0.0.0) 0.3.0 Yes, the patch is committed. Thanks > Pig does not handle Chinese characters (in both the parameter subsitution= using -param_file or embedded in the Pig script) correctly > -------------------------------------------------------------------------= ----------------------------------------------------------- > > Key: PIG-774 > URL: https://issues.apache.org/jira/browse/PIG-774 > Project: Pig > Issue Type: Bug > Components: grunt, impl > Affects Versions: 0.0.0 > Reporter: Viraj Bhat > Assignee: Daniel Dai > Priority: Critical > Fix For: 0.3.0 > > Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pi= g_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch > > > I created a very small test case in which I did the following. > 1) Created a UTF-8 file which contained a query string in Chinese and wro= te it to HDFS. I used this dfs file as an input for the tests. > 2) Created a parameter file which also contained the same query string as= in Step 1. > 3) Created a Pig script which takes in the parametrized query string and = hard coded Chinese character. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Pig script: chinese_data.pig > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > rmf chineseoutput; > I =3D load '/user/viraj/chinese.txt' using PigStorage('\u0001'); > J =3D filter I by $0 =3D=3D '$querystring'; > --J =3D filter I by $0 =3D=3D ' =E6=AD=8C=E6=89=8B =E9=A6=99=E6=B8=AF= =E6=83=85=E7=89=BD=E5=A5=B3=E4=BA=BA=E5=BF=83=E6=BC=94=E5=94=B1=E6=9C=83'; > store J into 'chineseoutput'; > dump J; > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Parameter file: nextgen_paramfile > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > queryid=3D20090311 > querystring=3D' =E6=AD=8C=E6=89=8B =E9=A6=99=E6=B8=AF=E6=83=85=E7=89= =BD=E5=A5=B3=E4=BA=BA=E5=BF=83=E6=BC=94=E5=94=B1=E6=9C=83' > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Input file: /user/viraj/chinese.txt > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > shell$ hadoop fs -cat /user/viraj/chinese.txt > =E6=AD=8C=E6=89=8B =E9=A6=99=E6=B8=AF=E6=83=85=E7=89=BD=E5=A5= =B3=E4=BA=BA=E5=BF=83=E6=BC=94=E5=94=B1=E6=9C=83 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > I ran the above set of inputs in the following ways: > Run 1: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=3D'' or= g.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClie= nt - Use GenericOptionsParser for parsing the > arguments. Applications should implement Tool for the same. > 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > 0% complete > 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > 100% complete > 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > Success! > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Run 2: removed the parameter substitution in the Pig script instead used = the following statement. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > J =3D filter I by $0 =3D=3D ' =E6=AD=8C=E6=89=8B =E9=A6=99=E6=B8=AF=E6= =83=85=E7=89=BD=E5=A5=B3=E4=BA=BA=E5=BF=83=E6=BC=94=E5=94=B1=E6=9C=83'; > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=3D'' or= g.apache.pig.Main chinese_data_withoutparam.pig > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClie= nt - Use GenericOptionsParser for parsing the > arguments. Applications should implement Tool for the same. > 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > 0% complete > 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > 100% complete > 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.execut= ionengine.mapReduceLayer.MapReduceLauncher - > Success! > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > In both cases: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > shell $ hadoop fs -ls /user/viraj/chineseoutput > Found 2 items > drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/c= hineseoutput/_logs > -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/c= hineseoutput/part-00000 > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > Additionally tried the dry-run option to figure out if the parameter subs= titution was occurring properly. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server=3D'' or= g.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > {code} > shell$ file chinese_data.pig.substituted=20 > chinese_data.pig.substituted: ASCII text > shell$ cat chinese_data.pig.substituted=20 > {code} > {code} > rmf chineseoutput; > I =3D load '/user/viraj/chinese.txt' using PigStorage('\u0001'); > J =3D filter I by $0 =3D=3D ' ?????? ??????????????????????????????'; > store J into 'chineseoutput'; > {code} > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > This issue has to do with the parser not handling UTF-8 data.=20 --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.