Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0160948E5 for ; Mon, 9 May 2011 21:33:31 +0000 (UTC) Received: (qmail 38308 invoked by uid 500); 9 May 2011 21:33:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 38259 invoked by uid 500); 9 May 2011 21:33:28 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 38251 invoked by uid 99); 9 May 2011 21:33:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 21:33:28 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [69.32.146.52] (HELO thomsonlearning.com) (69.32.146.52) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 May 2011 21:33:21 +0000 Received: from ([10.160.3.175]) by ohciniron01.thomsonlearning.com with ESMTP with TLS id 5502565.64118924; Mon, 09 May 2011 17:32:56 -0400 Received: from OHCINMBX01.corp.local ([10.160.3.160]) by ohcinht03.corp.local ([10.160.3.175]) with mapi; Mon, 9 May 2011 17:32:56 -0400 From: "Provalov, Ivan" To: "java-user@lucene.apache.org" Date: Mon, 9 May 2011 17:32:56 -0400 Subject: Non-English Languages Search Thread-Topic: Non-English Languages Search Thread-Index: AQHMDpCoS92SIMrbLU2R6Fwy7pe3tQ== Message-ID: <70EA5691BD59734784FF872CD1B9747A284F5D0B54@OHCINMBX01.corp.local> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org We are planning to ingest some non-English content into our application. A= ll content is OCR'ed and there are a lot of misspellings and garbage terms = because of this. Each document has one primary language with a some except= ions (e.g. a few English terms mixed in with primarily non-English document= terms). 1. Does it make sense to mix two or more different Latin-based languages in= the same index directory in Lucene (e.g. Spanish/French/English)? =20 2. What about mixing Latin and non-Latin languages? We ran tests on Englis= h and Chinese collections mixed together and didn't see any negative impact= (precision/recall). Any other potential issues? 3. Any recommendations for an Urdu analyzer? Thank you, Ivan Provalov= --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org