lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Provalov, Ivan" <Ivan.Prova...@cengage.com>
Subject Non-English Languages Search
Date Mon, 09 May 2011 21:32:56 GMT
We are planning to ingest some non-English content into our application.  All content is OCR'ed
and there are a lot of misspellings and garbage terms because of this.  Each document has
one primary language with a some exceptions (e.g. a few English terms mixed in with primarily
non-English document terms).

1. Does it make sense to mix two or more different Latin-based languages in the same index
directory in Lucene (e.g. Spanish/French/English)?  
2. What about mixing Latin and non-Latin languages?  We ran tests on English and Chinese collections
mixed together and didn't see any negative impact (precision/recall).  Any other potential
issues?
3. Any recommendations for an Urdu analyzer?

Thank you,

Ivan Provalov
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message