hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taeho Kang" <tka...@gmail.com>
Subject MapReduce with multi-languages
Date Tue, 08 Jul 2008 08:39:32 GMT
Dear Hadoop User Group,

What are elegant ways to do mapred jobs on text-based data encoded with
something other than UTF-8?

It looks like Hadoop assumes the text data is always in UTF-8 and handles
data that way - encoding with UTF-8 and decoding with UTF-8.
And whenever the data is not in UTF-8 encoded format, problems arise.

Here is what I'm thinking of to clear the situation.. correct and advise me
if you see my approaches look bad!

(1) Re-encode the original data with UTF-8?
(2) Replace the part of source code where UTF-8 encoder and decoder are
used?

Or has anyone of you guys had trouble with running map-red job on data with
multi-languages?

Any suggestions/advices are welcome and appreciated!

Regards,

Taeho

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message