Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D85DDB46 for ; Tue, 11 Sep 2012 05:54:48 +0000 (UTC) Received: (qmail 89114 invoked by uid 500); 11 Sep 2012 05:54:43 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 88821 invoked by uid 500); 11 Sep 2012 05:54:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 88774 invoked by uid 99); 11 Sep 2012 05:54:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 05:54:38 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Ajay.Srivastava@guavus.com designates 204.232.241.167 as permitted sender) Received: from [204.232.241.167] (HELO mx1.guavus.com) (204.232.241.167) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 05:54:29 +0000 Received: from mx1.guavus.com ([204.232.241.167]) by mx1.guavus.com ([204.232.241.167]) with mapi id 14.01.0355.002; Mon, 10 Sep 2012 22:54:07 -0700 From: Ajay Srivastava To: "user@hadoop.apache.org" Subject: Non utf-8 chars in input Thread-Topic: Non utf-8 chars in input Thread-Index: AQHNj+HaXteIJpRbTU6wI0TGVVEXQQ== Date: Tue, 11 Sep 2012 05:54:06 +0000 Message-ID: <8223D14F-C3F1-44F5-B2B7-622A181601AB@guavus.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [61.12.3.119] Content-Type: text/plain; charset="us-ascii" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Hi, I am using default inputFormat class for reading input from text files but = the input file has some non utf-8 characters. I guess that TextInputFormat class is default inputFormat class and it repl= aces these non utf-8 chars by "\uFFFD". If I do not want this behavior and = need actual char in my mapper what should be the correct inputFormat class = ? Regards, Ajay Srivastava=