hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshi, Rekha" <Rekha_Jo...@intuit.com>
Subject Re: Non utf-8 chars in input
Date Tue, 11 Sep 2012 08:01:36 GMT
Actually even if that works, it does not seem an ideal solution.

I think format and encoding are distinct, and enforcing format must not
enforce an encoding.So that means there must be a possibility to pass
encoding as a user choice on construction,
e.g.:TextInputFormat("your-encoding").
But I do not see that in api, so even if I extend
InputFormat/RecordReader, I will not be able to have a feature of
setEncoding() on my file format.Having that would be a good solution.

Thanks
Rekha

On 11/09/12 12:37 PM, "Joshi, Rekha" <Rekha_Joshi@intuit.com> wrote:

>Hi Ajay,
>
>Try SequenceFileAsBinaryInputFormat ?
>
>
>Thanks
>Rekha
>
>On 11/09/12 11:24 AM, "Ajay Srivastava" <Ajay.Srivastava@guavus.com>
>wrote:
>
>>Hi,
>>
>>I am using default inputFormat class for reading input from text files
>>but the input file has some non utf-8 characters.
>>I guess that TextInputFormat class is default inputFormat class and it
>>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>behavior and need actual char in my mapper what should be the correct
>>inputFormat class ?
>>
>>
>>
>>Regards,
>>Ajay Srivastava
>


Mime
View raw message