hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Srivastava <Ajay.Srivast...@guavus.com>
Subject Re: Non utf-8 chars in input
Date Tue, 11 Sep 2012 08:21:54 GMT

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding
for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written
my custom input format class.

Thanks for your inputs.

Ajay Srivastava

On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
> Thanks
> Rekha
> On 11/09/12 12:37 PM, "Joshi, Rekha" <Rekha_Joshi@intuit.com> wrote:
>> Hi Ajay,
>> Try SequenceFileAsBinaryInputFormat ?
>> Thanks
>> Rekha
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <Ajay.Srivastava@guavus.com>
>> wrote:
>>> Hi,
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>> Regards,
>>> Ajay Srivastava

View raw message