hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NOMURA Yoshihide <y.nom...@jp.fujitsu.com>
Subject Re: MapReduce with multi-languages
Date Fri, 11 Jul 2008 05:36:06 GMT
Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.


But I think you should convert the text file encoding to UTF-8 at present.


Taeho Kang:
> Dear Hadoop User Group,
> What are elegant ways to do mapred jobs on text-based data encoded with
> something other than UTF-8?
> It looks like Hadoop assumes the text data is always in UTF-8 and handles
> data that way - encoding with UTF-8 and decoding with UTF-8.
> And whenever the data is not in UTF-8 encoded format, problems arise.
> Here is what I'm thinking of to clear the situation.. correct and advise me
> if you see my approaches look bad!
> (1) Re-encode the original data with UTF-8?
> (2) Replace the part of source code where UTF-8 encoder and decoder are
> used?
> Or has anyone of you guys had trouble with running map-red job on data with
> multi-languages?
> Any suggestions/advices are welcome and appreciated!
> Regards,
> Taeho

NOMURA Yoshihide:
     Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
     Tel: 044-754-2675 (Ext: 7106-6916)
     Fax: 044-754-2570 (Ext: 7108-7060)
     E-Mail: [y.nomura@jp.fujitsu.com]

View raw message