lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Non-English Languages Search
Date Wed, 11 May 2011 13:24:14 GMT
On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<> wrote:
> We are planning to ingest some non-English content into our application.  All content
is OCR'ed and there are a lot of misspellings and garbage terms because of this.  Each document
has one primary language with a some exceptions (e.g. a few English terms mixed in with primarily
non-English document terms).

sounds like you should talk to Tom Burton-West!

> 1. Does it make sense to mix two or more different Latin-based languages in the same
index directory in Lucene (e.g. Spanish/French/English)?

I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.

> 2. What about mixing Latin and non-Latin languages?  We ran tests on English and Chinese
collections mixed together and didn't see any negative impact (precision/recall).  Any other
potential issues?

Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?

> 3. Any recommendations for an Urdu analyzer?

you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
as a stoplist.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message