Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 54923 invoked from network); 4 Dec 2007 11:53:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2007 11:53:11 -0000 Received: (qmail 59556 invoked by uid 500); 4 Dec 2007 11:52:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 59523 invoked by uid 500); 4 Dec 2007 11:52:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 59512 invoked by uid 99); 4 Dec 2007 11:52:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Dec 2007 03:52:53 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.66] (HELO spunkymail-a4.g.dreamhost.com) (208.97.132.66) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Dec 2007 11:52:53 +0000 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a4.g.dreamhost.com (Postfix) with ESMTP id 1CCF53BA0F for ; Tue, 4 Dec 2007 03:52:22 -0800 (PST) Message-Id: <6DD0DED1-F86B-45F5-A1AF-11A38DF7A3C3@apache.org> From: Grant Ingersoll To: java-user@lucene.apache.org In-Reply-To: <475531BF.8060706@gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v915) Subject: Re: Indexing Non-English text Date: Tue, 4 Dec 2007 06:52:20 -0500 References: <475531BF.8060706@gmail.com> X-Mailer: Apple Mail (2.915) X-Virus-Checked: Checked by ClamAV on apache.org FileReader is dependent on your local locale. http://wiki.apache.org/lucene-java/IndexingOtherLanguages has some useful tips. Essentially, you need to make sure you control the encodings at all input points of your application. Lucene will do the appropriate thing internally. On Dec 4, 2007, at 5:53 AM, Liaqat Ali wrote: > Hi, > I m facing a problem while indexing a small .txt file with Lucene. > The file which i want to index with lucene is in Urdu language > (varient of Arabic and Persian). But the Index i get is in Unicode > form, not in the real form (original Urdu text). This program works > good for a file in English language. This is the code i use for > indexing.. > > FileReader file = new FileReader ("urdoc.txt"); > BufferedReader buff = new BufferedReader(file); > String line = buff.readLine(); > boolean eof = false; > buff.close(); > String indexDir = "D:\\index"; > Analyzer analyzer = new StandardAnalyzer(); > boolean createFlag = true; > IndexWriter writer = > new IndexWriter(indexDir, analyzer, createFlag); > Document document = new Document(); > document.add(new Field("fieldname",line, Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(document); > writer.close(); > > Kindly guide me, what I should do, would i have to change this code > or whatever else do you suggest? > > Liaqat > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > -------------------------- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org